What Drives Compositional Generalization in Visual Generative Models?

Overview

Abstract. Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.

Teaser: what drives compositional generalization

RQ1. Does the type of the tokenizer affect compositional generalization? (VAE with KL regularization vs. VQ-VAE with quantization/commitment regularization.)
RQ2. How does the generative model design affect compositionality? In particular, does it matter whether the modeled distribution is continuous or discrete (e.g., continuous latents vs. discrete tokens)? And is a denoising-based objective essential, or does a masking-based loss suffice?
RQ3. How does the nature of conditioning during training influence compositionality?
RQ4. Can we intervene on non-compositional models to endow them with compositional capabilities?

Evaluation Setup: To answer these questions, we conduct controlled experiments where models are trained on a subset of possible factor combinations (e.g., 4 out of 8 combinations of gender, hair color, and smile) and evaluated on held-out compositions. We use trained probes to verify whether generated samples correctly match the intended factors. We distinguish between level-1 compositions (one factor differs from training) and level-2 compositions (two factors differ), with level-2 serving as the strongest test of compositional generalization. Throughout our results, blue curves show performance on training compositions, pink curves show level-1 generalization, and red curves show level-2 generalization. Higher probe accuracy indicates the model successfully generates the intended factor combinations.

Does the choice of tokenizer matter?

DiT achieves comparable compositional generalization performance across both tokenizer types by the end of training. The training dynamics, however, differ: with a continuous tokenizer progress is more gradual and steady, whereas with a discrete tokenizer, compositional generalization emerges more abruptly.

(a) VAE

(b) VQ-VAE

The choice of tokenizer does not fundamentally alter the compositional abilities of a generative model. It primarily affects training efficiency and stability.

Output Space Continuity is the Key.

Through systematically controlling for the main differences between DiT and MaskGIT, including the choice of the tokenizer, masking strategies and loss functions, we find that none of these factors critically impact compositional generalization. The remaining distinguishing factor is the nature of the predicted outputs: DiT predicts continuous-valued quantities, while MaskGIT predicts discrete tokens. Based on our experiments, we can conclude that this difference in output representation (and thereby the respective objective) is the key factor underlying DiT’s ability to achieve compositional generalization, not observed in MaskGIT.

Objective Comparison (Shapes2D)

(a) MaskGIT

(b) GIVT

(c) MAR

(d) DiT

Generative models trained to model continuous distributions reliably exhibit strong compositionality, whereas models trained on discrete, categorical distributions do not.

Importance of the Conditioning Information Levels

Compositional generalization becomes less stable under quantized conditioning. When some factors present in the image are occasionally missing from the conditioning signal, compositional generalization typically fails (see (2b) Conditioning dropout). Combining both conditions—quantization and incompleteness—a setting common in practice—produces the strongest negative effect (see (2c) Discrete (quantized) conditioning and (2d) Average over seeds).

Overall, these results show that limited information—whether through quantization or incomplete conditioning—can impair compositional generalization, even when all factors are provided during generation.

(a) Full-information conditioning

(b) Full-information conditioning + dropout

(c) Quantized conditioning

(d) Quantized conditioning + dropout

Full, precise conditioning is critical for robust compositional generalization. Models trained with quantized or incomplete signals show poor or inconsistent recombination of factors.

JEPA Auxiliary Objective Helps Discrete Models

In the previous section, we identified a key factor that can hinder compositional generalization in modern generative models: training objectives that operate over discrete, categorical distributions. Despite this, discrete training objectives—such as the one used in MaskGIT—offer advantages in, e.g., sampling speed compared to alternatives like DiT. This raises an important question: Can we retain these advantages while also improving compositional generalization?

An overview of MaskGIT combined with the JEPA-based training objective. We apply the JEPA loss at specific layers (l) on an intermediate masked token representation in the transformer (H_C^(l)) and train a lightweight predictor to reconstruct target states (H_T^(l)) using MSE as an error metric and a stop-gradient signal to avoid representation collapse.

Overview of MaskGIT with JEPA-based auxiliary objective (stop-gradient on target branch)

Standard vs. JEPA objective

(a) MaskGIT with standard categorical objective

(a) Standard Objective (MaskGIT)

(b) MaskGIT with JEPA-based auxiliary objective

(b) JEPA-based Objective (MaskGIT + JEPA)

(c) Polysemanticity Trend

(d) Jaccard Trend

We also analyze how the model separates visual factors like color and shape. First, we find that some attention heads are polysemantic, attending to multiple features at once, which leads to entangled representations. Next, by examining the most influential neurons for each concept, we observe how internal circuits form and overlap. (c) and (d) show that adding the auxiliary JEPA objective reduces both polysemanticity and neuron overlap, leading to more distinct, factor-specific representations.

A JEPA-based training objective induces more disentangled and semantically structured representations, and enables stronger compositional generalization.

Results

We validate our findings across datasets of increasing complexity and different modalities. We start with simple synthetic Shapes2D, then move to Shapes3D, a much larger, fully factorial dataset with far more possible compositions and harder-to-disentangle factors. To test real-world images we use CelebA, and for video we evaluate on synthetic CLEVRER and on CoVLA — the latter using a learned world model. Finally, we present first evidence that the trends extend to language by testing Points224 with LLaMa-3 using different reasoning mechanisms.

Shapes3D — Real vs. DiT vs. MaskGIT

Shapes3D generations: Real vs. DiT vs. MaskGIT

While our results on CelebA show that our findings hold true for real-world data, we also wanted to test if we observe the same properties for compositional generalization when increase the complexity of the dataset itself, making the task harder. To do so, we test on Shapes3D- a synthetic dataset of rendered 3D scenes with six, varying, independent factors, Of these, we select three factors to control for, varying in 240 different compositions (up from 8).

Pale orange marks level 1 compositions, red highlights the most novel level 2 compositions, and blue denotes seen combinations.

Shapes3D generations by MaskGIT — MaskGIT

Real-World Validation on CoVLA

To test whether our findings hold in real-world video, we evaluate compositional generalization on CoVLA, a driving dataset with factors like time of day and turn direction. We use a learned world model to assess how well the model can generate or predict unseen combinations of these factors.

☀️→ 🌑← Discrete MaskGIT right-turn during daytime

Continuous Non-MaskGIT right-turn during daytime

Results — CRA Retrieval (CoVLA)

CRA retrieval performance on CoVLA for DiT and MaskGIT across splits

CRA metric (V-JEPA2 features) for generated videos vs. real splits. Higher Hit@1 indicates stronger compositional alignment.

Language

Our findings on visual models appear to extend to language. Evaluating LLaMa-3.2 on the Points24 dataset, we compared discrete Chain-of-Thought (CoT) reasoning with its continuous variant, COntinuous-Chain-Of-Thought (COCONUT). COCONUT achieved substantially higher accuracy (12.39% vs. 4.82%) on compositional splits, indicating that continuous reasoning objectives may also enhance compositional generalization in language. Further exploration of this cross-modal consistency remains an open direction.

Cite

@article{farid2025compositional,
  title   = {What Drives Compositional Generalization in Visual Generative Models?},
  author  = {Farid, Karim and Sahay, Rajat and Alnaggar, Yumna Ali and Schrodi, Simon and Fischer, Volker and Schmid, Cordelia and Brox, Thomas},
  journal = {arXiv preprint arXiv:2510.03075},
  year    = {2025},
  url     = {https://arxiv.org/abs/2510.03075}
}

Contact

📧 faridk@cs.uni-freiburg.de