← Back to Paper List

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Xiawu Zheng, Yan Zhang, Yiyi Zhou, Rongrong Ji, Xing Sun
Tencent Youtu Lab, Nanjing University, East China Normal University, Xiamen University
arXiv.org (2025)
MM Pretraining Benchmark

📝 Paper Summary

Long-context Multi-modal Models Vision-Language Alignment Efficient Long-context Inference
Long-VITA scales open-source multi-modal models to process 1 million tokens (video/image/text) using a four-stage training pipeline and optimized inference techniques, achieving strong performance on both short and long-context tasks without proprietary data.
Core Problem
Most open-source vision-language models struggle with long-context inputs (like long videos or many images) compared to proprietary models like Gemini 1.5 Pro, and existing solutions often degrade performance on standard short-context tasks.
Why it matters:
  • Proprietary models can process 1 hour of video or 1M tokens, but open-source equivalents lag significantly, limiting public research and application
  • Existing long-context methods often focus solely on video, neglecting heavy multi-image scenarios or sacrificing static image quality
  • Token compression techniques used to handle long sequences often lead to performance degradation/information loss
Concrete Example: When processing a long video or a comic book with hundreds of pages, standard models run out of context window or fail to recall specific details. Current open-source models usually cap at much shorter lengths or use compression that blurs fine-grained visual details needed for accurate Q&A.
Key Novelty
Phased Long-Context Scaling with Logits-Masked Inference
  • A four-stage training pipeline that starts with standard alignment and general knowledge, then progressively extends context length (128K -> 1M) using specialized long-context data (comics, movie summaries)
  • Introduction of a logits-masked language modeling head during inference that reduces memory usage by only computing logits for the specific next-token prediction, enabling massive context processing on limited hardware
Evaluation Highlights
  • Extends context length to 1 million tokens, supporting processing of over 4K video frames
  • Achieves 4x context length extension and 2x prefill speedup on a single node with 8 GPUs using optimized inference designs
  • Outperforms proprietary GPT-4V on LongVideoBench (51.8 vs ~50 estimated from charts/context) and matches state-of-the-art open models on short-context benchmarks like MMBench (81.5) and MMMU (57.4) [Long-VITA-16K]
Breakthrough Assessment
8/10
Strong engineering contribution scaling open-source multimodal context to 1M tokens. The release of training recipes, datasets (Comic-9K), and memory optimizations makes it a significant resource, though the architecture itself relies on established components.
×