Better sources lead to better flows: learning condition-dependent source distributions, rather than a fixed Gaussian, improves conditional generation, such as text-to-image synthesis.

Introducing the Condition-dependent Source Flow Matching (CSFM)

While flow matching allows arbitrary source distributions, most existing approaches rely on a fixed Gaussian and rarely treat the source distribution itself as an optimization target. Condition-dependent Source Flow Matching (CSFM) demonstrates that a principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems, by learning a condition-dependent source distribution that better exploits rich conditioning information.

Main teaser

Flow matching does not require the source distribution to be standard Gaussian. We leverage this flexibility by learning a condition-dependent source distribution, which reduces objective variance and improves flow matching performance.

CSFM

Using simple 2D toy experiments, we begin by studying what happens when the source distribution is learned naively. These experiments reveal clear failure modes highlighting that learning the source distribution is non-trivial and requires careful design.

Toy experiments

We evaluate different source design choices on two 2D synthetic datasets with continuous conditions: Eight Gaussians (polar angle condition) and Two Moons (x-coordinate condition). We visualize transport trajectories, where × denotes source points (X0) and denotes generated target samples, with colors indicating the conditioning variable.

Learning a Conditional Source Distribution (B)

We explicitly learn the source distribution in a condition-dependent manner, allowing it to adapt to the conditioning signal and enabling end-to-end optimization of the source–target coupling for conditional generation tasks such as text-to-image synthesis. However, implementing the conditional source as a deterministic mapping severely restricts its support and leads to degenerate transport.

Conditional Gaussian for Sufficient Support (C)

To ensure sufficient support and smooth interpolation, we parameterize the source as a conditional Gaussian. However, despite this Gaussian parameterization, joint training with the flow model often drives the conditional variance toward zero.

Limitations of KL-based Regularization (D)

A common approach to stabilizing conditional Gaussian sources is to regularize them toward a standard normal distribution using KL divergence. However, this constrains both the variance and the mean, preventing the source from relocating toward target modes and resulting in entangled transport paths with limited performance gains.

Variance-only Regularization (VarReg) (E)

To overcome this limitation, we introduce variance-only regularization, which controls the source variance while leaving the mean unconstrained. This allows the source distribution to move freely toward target modes while maintaining sufficient spread.

Source–Target Directional Alignment

In practical conditional generation tasks such as text-to-image synthesis, the flow model is typically equipped with strong conditional modeling capacity. Consequently, the flow matching objective provides relatively weak learning signals for the source distribution, making it difficult to learn an informative source in practice. To address this, we explicitly encourage directional alignment between the learned source and target samples, guiding the source to better reflect the target structure in high-dimensional settings and facilitating more stable optimization.

Together, these components define Condition-dependent Source Flow Matching (CSFM), a framework that learns condition-dependent source distributions with sufficient support, controlled variance, and explicit source–target alignment—resulting in reduced path entanglement and simplified flow learning.

Advantages of CSFM

CSFM improves flow matching by reducing intrinsic variance through a condition-dependent source, leading to faster and more stable training. This results in higher sample quality with fewer training and sampling steps, including more robust few-step generation due to straighter transport paths. CSFM also consistently outperforms prior condition-aware source methods and remains effective when combined with guidance, demonstrating strong practicality in conditional generation settings.

Flow matching loss and gradient variance
CSFM shows a faster decrease in the flow matching loss and consistently lower gradient variance than standard FM, with particularly clear gains at small interpolation times near the source. As a secondary observation, removing the alignment loss results in higher FM loss and increased gradient variance, suggesting that alignment also contributes to improved flow matching optimization.
Training efficiency curves
The improved training dynamics translate into performance gains: CSFM converges faster in both FID and CLIP Score, including up to 3.01Ă— faster convergence in FID and 2.48Ă— faster convergence in CLIP Score.
Few step FID comparison
CSFM degrades more gracefully as the number of sampling steps decreases, indicating reduced path variance and a straighter transport field. Under 1-Reflow, CSFM exhibits substantially smaller degradation under aggressive step reduction than standard FM.
Comparison with source learning methods
Compared to prior approaches that modify the training source distribution, such as CrossFlow and C2OT, CSFM achieves better FID and CLIP Score under the same evaluation setting.
Guidance comparison
Because the learned source already encodes conditional information, CSFM does not adopt classifier-free guidance and is instead evaluated with autoguidance. CSFM remains effective with autoguidance, achieving gains comparable to those in the no-guidance setting.

We scale CSFM to a 1.3B-parameter text-to-image model and find that learning a condition-dependent source remains effective at this scale. Qualitative results from the large model demonstrate that the benefits of learnable source distributions persist in high-capacity settings.

qualitative results
qualitative comparison results
qualitative comparison results

Component-wise Analysis of CSFM

We validate that the insights from toy experiments extend to practical text-to-image generation by constructing an ImageNet-based dataset with descriptive captions and evaluating our method on it. We examine key design choices in realistic settings and further test robustness across different conditioning architectures and text encoders. Across these settings, the proposed design remains consistently effective, demonstrating that our analysis generalizes beyond simplified toy scenarios to practical text-to-image models.

Component analysis

We evaluate individual components on a captioned ImageNet dataset. Gray rows indicate fixed-Gaussian baselines; bold entries denote the default setting; † indicates a parameter-matched baseline.

Target Representation Matters

The effectiveness of learning a condition-dependent source depends strongly on the structure of the target representation. When the target space exhibits well-separated and discriminative structure with respect to the conditioning signal, source learning becomes more effective. Consistent with this observation, we observe substantially larger gains in both FID and CLIP Score in the RAE latent space compared to the SD-VAE latent space.

This effect is illustrated by t-SNE visualizations, where structured target representations (RAE) lead to more organized and discriminative learned sources, while poorly structured representations (SD-VAE) result in entangled targets and sources that resemble a fixed Gaussian prior. These observations highlight that CSFM benefits most when applied to target representations with clear condition-dependent structure.

Source distribution comparison
Comparison between SD-VAE and RAE

Conclusion

In this work, we present Condition-dependent Source Flow Matching (CSFM), demonstrating that principled design of the source distribution can improve flow matching models by facilitating more favorable training dynamics and leading to consistent performance gains. Through extensive experiments and analyses, we elucidate the core mechanisms underlying our approach and show how condition-dependent source design enables more efficient and stable learning in complex conditional generation settings.

Citation

If you use this work or find it helpful, please consider citing: