Abstract. Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.
Evaluation Setup: To answer these questions, we conduct controlled experiments where models are trained on a subset of possible factor combinations (e.g., 4 out of 8 combinations of gender, hair color, and smile) and evaluated on held-out compositions. We use trained probes to verify whether generated samples correctly match the intended factors. We distinguish between level-1 compositions (one factor differs from training) and level-2 compositions (two factors differ), with level-2 serving as the strongest test of compositional generalization. Throughout our results, blue curves show performance on training compositions, pink curves show level-1 generalization, and red curves show level-2 generalization. Higher probe accuracy indicates the model successfully generates the intended factor combinations.
DiT achieves comparable compositional generalization performance across both tokenizer types by the end of training. The training dynamics, however, differ: with a continuous tokenizer progress is more gradual and steady, whereas with a discrete tokenizer, compositional generalization emerges more abruptly.
Through systematically controlling for the main differences between DiT and MaskGIT, including the choice of the tokenizer, masking strategies and loss functions, we find that none of these factors critically impact compositional generalization. The remaining distinguishing factor is the nature of the predicted outputs: DiT predicts continuous-valued quantities, while MaskGIT predicts discrete tokens. Based on our experiments, we can conclude that this difference in output representation (and thereby the respective objective) is the key factor underlying DiT’s ability to achieve compositional generalization, not observed in MaskGIT.
Compositional generalization becomes less stable under quantized conditioning. When some factors present in the image are occasionally missing from the conditioning signal, compositional generalization typically fails (see (2b) Conditioning dropout). Combining both conditions—quantization and incompleteness—a setting common in practice—produces the strongest negative effect (see (2c) Discrete (quantized) conditioning and (2d) Average over seeds).
Overall, these results show that limited information—whether through quantization or incomplete conditioning—can impair compositional generalization, even when all factors are provided during generation.
In the previous section, we identified a key factor that can hinder compositional generalization in modern generative models: training objectives that operate over discrete, categorical distributions. Despite this, discrete training objectives—such as the one used in MaskGIT—offer advantages in, e.g., sampling speed compared to alternatives like DiT. This raises an important question: Can we retain these advantages while also improving compositional generalization?
An overview of MaskGIT combined with the JEPA-based training objective. We apply the JEPA loss at specific layers (l) on an intermediate masked token representation in the transformer (HC(l)) and train a lightweight predictor to reconstruct target states (HT(l)) using MSE as an error metric and a stop-gradient signal to avoid representation collapse.
We also analyze how the model separates visual factors like color and shape. First, we find that some attention heads are polysemantic, attending to multiple features at once, which leads to entangled representations. Next, by examining the most influential neurons for each concept, we observe how internal circuits form and overlap. (c) and (d) show that adding the auxiliary JEPA objective reduces both polysemanticity and neuron overlap, leading to more distinct, factor-specific representations.
We validate our findings across datasets of increasing complexity and different modalities. We start with simple synthetic Shapes2D, then move to Shapes3D, a much larger, fully factorial dataset with far more possible compositions and harder-to-disentangle factors. To test real-world images we use CelebA, and for video we evaluate on synthetic CLEVRER and on CoVLA — the latter using a learned world model. Finally, we present first evidence that the trends extend to language by testing Points224 with LLaMa-3 using different reasoning mechanisms.
While our results on CelebA show that our findings hold true for real-world data, we also wanted to test if we observe the same properties for compositional generalization when increase the complexity of the dataset itself, making the task harder. To do so, we test on Shapes3D- a synthetic dataset of rendered 3D scenes with six, varying, independent factors, Of these, we select three factors to control for, varying in 240 different compositions (up from 8).
Pale orange marks level 1 compositions, red highlights the most novel level 2 compositions, and blue denotes seen combinations.
To test whether our findings hold in real-world video, we evaluate compositional generalization on CoVLA, a driving dataset with factors like time of day and turn direction. We use a learned world model to assess how well the model can generate or predict unseen combinations of these factors.
Continuous
Our findings on visual models appear to extend to language. Evaluating LLaMa-3.2 on the Points24 dataset, we compared discrete Chain-of-Thought (CoT) reasoning with its continuous variant, COntinuous-Chain-Of-Thought (COCONUT). COCONUT achieved substantially higher accuracy (12.39% vs. 4.82%) on compositional splits, indicating that continuous reasoning objectives may also enhance compositional generalization in language. Further exploration of this cross-modal consistency remains an open direction.
@article{farid2025compositional,
title = {What Drives Compositional Generalization in Visual Generative Models?},
author = {Farid, Karim and Sahay, Rajat and Alnaggar, Yumna Ali and Schrodi, Simon and Fischer, Volker and Schmid, Cordelia and Brox, Thomas},
journal = {arXiv preprint arXiv:2510.03075},
year = {2025},
url = {https://arxiv.org/abs/2510.03075}
}