← Back to Paper List

How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

Haoyu Chen, Qing Liu, Yuqian Zhou, He Zhang, Zhaowen Wang, Mengwei Ren, Jingjing Ren, Xiang Wang, Zhe Lin, Lei Zhu
Not explicitly reported in the paper
arXiv (2026)
MM Memory Benchmark

📝 Paper Summary

Unified Multimodal Models (UMMs) Long-context Generation
UniLongGen stabilizes long-horizon multimodal generation by dynamically purging interfering visual history from memory based on layer-specific attention relevance, addressing the specific problem of active visual pollution.
Core Problem
Unified multimodal models suffer from rapid quality collapse in long interleaved sequences (generating images and text), failing after approximately 20 images regardless of available memory.
Why it matters:
  • Prevents the creation of coherent long-form visual narratives, storyboards, and iterative designs that require dozens of consistent turns
  • Reveals a structural vulnerability in autoregressive models where dense visual history actively corrupts generation (pollution) rather than just fading away (dilution)
  • Standard long-context solutions (token compression/retrieval) fail because they often preserve the 'heavy-tailed' outliers that cause the corruption
Concrete Example: In a 40-image narrative generation, a standard model maintains quality for the first 17 images but collapses into unrecognized noise by image 30, even if the token count is within the model's theoretical limit (e.g., 150k tokens).
Key Novelty
UniLongGen (Training-free Context Curation)
  • Identifies the 'Event Bottleneck': generation fails based on the number of discrete visual events (~20 images), not raw token count
  • Distinguishes between 'passive dilution' (text history) and 'active pollution' (visual history) where historical image tokens hijack attention
  • Implements a layer-split visibility policy: early layers attend to text-filtered history (for grounding), while late layers attend to image-filtered history (for synthesis)
Evaluation Highlights
  • Identifies a precise 'Event Bottleneck' where generation consistently collapses after ~20-25 images across different resolutions
  • Demonstrates that 150k tokens of text history allow high fidelity, while 150k tokens of image history (representing ~30 images) lead to total collapse
  • UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency (qualitative result, specific metrics not in text)
Breakthrough Assessment
8/10
Provides a fundamental mechanistic explanation ('active pollution') for a common failure mode in multimodal models and proposes a training-free solution derived directly from these insights.
×