Introducing the Condition-dependent Source Flow Matching (CSFM)
While flow matching allows arbitrary source distributions, most existing approaches rely on a fixed Gaussian and rarely treat the source distribution itself as an optimization target. Condition-dependent Source Flow Matching (CSFM) demonstrates that a principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems, by learning a condition-dependent source distribution that better exploits rich conditioning information.
Flow matching does not require the source distribution to be standard Gaussian. We leverage this flexibility by learning a condition-dependent source distribution, which reduces objective variance and improves flow matching performance.
CSFM
Using simple 2D toy experiments, we begin by studying what happens when the source distribution is learned naively. These experiments reveal clear failure modes highlighting that learning the source distribution is non-trivial and requires careful design.
We evaluate different source design choices on two 2D synthetic datasets with continuous conditions: Eight Gaussians (polar angle condition) and Two Moons (x-coordinate condition). We visualize transport trajectories, where × denotes source points (X0) and • denotes generated target samples, with colors indicating the conditioning variable.
Learning a Conditional Source Distribution (B)
We explicitly learn the source distribution in a condition-dependent manner, allowing it to adapt to the conditioning signal and enabling end-to-end optimization of the source–target coupling for conditional generation tasks such as text-to-image synthesis. However, implementing the conditional source as a deterministic mapping severely restricts its support and leads to degenerate transport.
Conditional Gaussian for Sufficient Support (C)
To ensure sufficient support and smooth interpolation, we parameterize the source as a conditional Gaussian. However, despite this Gaussian parameterization, joint training with the flow model often drives the conditional variance toward zero.
Limitations of KL-based Regularization (D)
A common approach to stabilizing conditional Gaussian sources is to regularize them toward a standard normal distribution using KL divergence. However, this constrains both the variance and the mean, preventing the source from relocating toward target modes and resulting in entangled transport paths with limited performance gains.
Variance-only Regularization (VarReg) (E)
To overcome this limitation, we introduce variance-only regularization, which controls the source variance while leaving the mean unconstrained. This allows the source distribution to move freely toward target modes while maintaining sufficient spread.
Source–Target Directional Alignment
In practical conditional generation tasks such as text-to-image synthesis, the flow model is typically equipped with strong conditional modeling capacity. Consequently, the flow matching objective provides relatively weak learning signals for the source distribution, making it difficult to learn an informative source in practice. To address this, we explicitly encourage directional alignment between the learned source and target samples, guiding the source to better reflect the target structure in high-dimensional settings and facilitating more stable optimization.
Together, these components define Condition-dependent Source Flow Matching (CSFM), a framework that learns condition-dependent source distributions with sufficient support, controlled variance, and explicit source–target alignment—resulting in reduced path entanglement and simplified flow learning.
Advantages of CSFM
CSFM improves flow matching by reducing intrinsic variance through a condition-dependent source, leading to faster and more stable training. This results in higher sample quality with fewer training and sampling steps, including more robust few-step generation due to straighter transport paths. CSFM also consistently outperforms prior condition-aware source methods and remains effective when combined with guidance, demonstrating strong practicality in conditional generation settings.
We scale CSFM to a 1.3B-parameter text-to-image model and find that learning a condition-dependent source remains effective at this scale. Qualitative results from the large model demonstrate that the benefits of learnable source distributions persist in high-capacity settings.
Component-wise Analysis of CSFM
We validate that the insights from toy experiments extend to practical text-to-image generation by constructing an ImageNet-based dataset with descriptive captions and evaluating our method on it. We examine key design choices in realistic settings and further test robustness across different conditioning architectures and text encoders. Across these settings, the proposed design remains consistently effective, demonstrating that our analysis generalizes beyond simplified toy scenarios to practical text-to-image models.
We evaluate individual components on a captioned ImageNet dataset. Gray rows indicate fixed-Gaussian baselines; bold entries denote the default setting; †indicates a parameter-matched baseline.
Target Representation Matters
The effectiveness of learning a condition-dependent source depends strongly on the structure of the target representation. When the target space exhibits well-separated and discriminative structure with respect to the conditioning signal, source learning becomes more effective. Consistent with this observation, we observe substantially larger gains in both FID and CLIP Score in the RAE latent space compared to the SD-VAE latent space.
This effect is illustrated by t-SNE visualizations, where structured target representations (RAE) lead to more organized and discriminative learned sources, while poorly structured representations (SD-VAE) result in entangled targets and sources that resemble a fixed Gaussian prior. These observations highlight that CSFM benefits most when applied to target representations with clear condition-dependent structure.
Conclusion
In this work, we present Condition-dependent Source Flow Matching (CSFM), demonstrating that principled design of the source distribution can improve flow matching models by facilitating more favorable training dynamics and leading to consistent performance gains. Through extensive experiments and analyses, we elucidate the core mechanisms underlying our approach and show how condition-dependent source design enables more efficient and stable learning in complex conditional generation settings.
Citation
If you use this work or find it helpful, please consider citing: