Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

📝 Paper Summary

Long-context Multi-modal Models Vision-Language Alignment Efficient Long-context Inference

Long-VITA scales open-source multi-modal models to process 1 million tokens (video/image/text) using a four-stage training pipeline and optimized inference techniques, achieving strong performance on both short and long-context tasks without proprietary data.

Core Problem

Most open-source vision-language models struggle with long-context inputs (like long videos or many images) compared to proprietary models like Gemini 1.5 Pro, and existing solutions often degrade performance on standard short-context tasks.

Why it matters:

Proprietary models can process 1 hour of video or 1M tokens, but open-source equivalents lag significantly, limiting public research and application
Existing long-context methods often focus solely on video, neglecting heavy multi-image scenarios or sacrificing static image quality
Token compression techniques used to handle long sequences often lead to performance degradation/information loss

Concrete Example: When processing a long video or a comic book with hundreds of pages, standard models run out of context window or fail to recall specific details. Current open-source models usually cap at much shorter lengths or use compression that blurs fine-grained visual details needed for accurate Q&A.

Key Novelty

Phased Long-Context Scaling with Logits-Masked Inference

A four-stage training pipeline that starts with standard alignment and general knowledge, then progressively extends context length (128K -> 1M) using specialized long-context data (comics, movie summaries)
Introduction of a logits-masked language modeling head during inference that reduces memory usage by only computing logits for the specific next-token prediction, enabling massive context processing on limited hardware

Evaluation Highlights

Extends context length to 1 million tokens, supporting processing of over 4K video frames
Achieves 4x context length extension and 2x prefill speedup on a single node with 8 GPUs using optimized inference designs
Outperforms proprietary GPT-4V on LongVideoBench (51.8 vs ~50 estimated from charts/context) and matches state-of-the-art open models on short-context benchmarks like MMBench (81.5) and MMMU (57.4) [Long-VITA-16K]

Breakthrough Assessment

8/10

Strong engineering contribution scaling open-source multimodal context to 1M tokens. The release of training recipes, datasets (Comic-9K), and memory optimizations makes it a significant resource, though the architecture itself relies on established components.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal autoregressive generation where the model takes images, videos, and text as input and generates text responses

Inputs: Interleaved sequence of text tokens and visual embeddings (from images or video frames), potentially up to 1 million tokens

Outputs: Text response (e.g., answer to a question, summary, or caption)

Pipeline Flow

Vision Encoder (InternViT-300M)
Vision-Language Projector (MLP)
Large Language Model (Qwen2.5-14B-Instruct)

System Modules

Vision Encoder (Input Processing)

Extract visual features from images and video frames

Model or implementation: InternViT-300M

Vision-Language Projector (Input Processing)

Project visual features into the LLM's word embedding space

Model or implementation: 2-layer MLP

Large Language Model

Process multimodal tokens and generate text response

Model or implementation: Qwen2.5-14B-Instruct

Novel Architectural Elements

Logits-Masked Language Modeling Head: Masks out hidden features not needed for next-token prediction to reduce memory footprint of the final projection layer (10^6 x 10^5 matrix reduction)
Context-Parallelism Distributed Inference: Splits the long context across devices during inference, handling infinite-length inputs via distribution attention

Modeling

Base Model: Qwen2.5-14B-Instruct (LLM) + InternViT-300M (Vision)

Training Method: Four-stage supervised fine-tuning pipeline

Objective Functions:

Purpose: Minimize the difference between generated tokens and ground truth text.

Formally: Standard autoregressive language modeling loss (Cross-Entropy).

Trainable Parameters: Vision projector (Stage 1), Full model including Vision Encoder and LLM (Stage 2, 3, 4)

Training Data:

Stage 1 (Alignment): 32K length, Caption data only (Projector training)
Stage 2 (General Knowledge): 16K length, 40B tokens, Mix of VQA, Caption, OCR, Text-only (Full training)
Stage 3 (Long-Seq 128K): 128K length, 8B tokens, Long text + Comic-9K + Video
Stage 4 (Long-Seq 1M): 1M length, 4B tokens, Long text + MovieNet-Summary

Key Hyperparameters:

learning_rate: 1.0e-5 (LLM/Projector in later stages), 1.0e-6 (Vision)
batch_size: 528 (Stage 1/2), 64 (Stage 3), 8 (Stage 4)
weight_decay: 0.0
+ 3 more
adam_beta2: 0.999
gradient_clip: 1.0
rotary_base: 1,000,000

Compute: Training used 64 NPUs (Stage 1) to 128 NPUs (Stage 4). Inference tested on 8x80G GPUs and 8x96G GPUs.

Comparison to Prior Work

vs. LongVILA: Long-VITA supports 1M tokens (vs 256K) and includes image-heavy comic data, not just video
vs. LLaVA-OneVision: Long-VITA explicitly targets massive context (1M) via multi-stage training and specialized inference optimizations (logits masking)
vs. Kangaroo [not cited in paper]: Kangaroo uses curriculum training for long video; Long-VITA uses a 4-stage curriculum including comics and text-heavy data for broader multimodal long-context

Limitations

Extremely long context (1M) training requires significant computational resources (128 NPUs)
Performance on some short-context benchmarks (e.g., MMMU) drops slightly in the 1M model compared to the 16K model
Inference requires specialized hardware setup (multi-GPU) for the full 1M context capability

Reproducibility

Code: https://github.com/Tencent/VITA

Publicly available: Long-VITA model weights, code, Comic-9K dataset, MovieNet-Summary dataset. Training relies entirely on open-source datasets (listed in Table 1). No proprietary data used.

📊 Experiments & Results

Evaluation Setup

Evaluation across diverse benchmarks covering short-context (images), medium-context, and long-context (video/interleaved) tasks.

Benchmarks:

MMBench (General multimodal capability)
MMMU (Multimodal multi-discipline understanding)
LongVideoBench (Long-context video understanding)
Video-MME (Video understanding (short, medium, long))
Comic-9K (Multi-image summarization (newly constructed)) [New]

Metrics:

Accuracy
Score (custom per benchmark)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on standard short-context multimodal benchmarks shows Long-VITA remains competitive with state-of-the-art open models despite being optimized for long context.
MMBench (MMB)	Accuracy	80.9	81.5	+0.6
MMMU	Accuracy	51.2	57.0	+5.8
Long-context video understanding results demonstrate scaling capabilities.
LongVideoBench	Score	57.1	60.9	+3.8
Video-MME	Overall Score	63.7	66.4	+2.7
Video-MME (Long split)	Score	54.8	58.8	+4.0
Inference efficiency tests show significant gains from the Logits-Masked Head.
Internal Inference Test (1.6M tokens)	Max Sequence Length	103000	420000	+317000

Main Takeaways

Long-VITA achieves a balance between short-context precision and long-context capacity, with the 128K and 16K models often performing best on standard benchmarks.
The Logits-Masked Language Modeling Head is a crucial engineering optimization that enables processing 1M+ tokens on consumer/research hardware by drastically reducing memory overhead.
Training on diverse long-context data (comics, long videos) prevents degradation usually seen when extending context, allowing the model to handle 4,000+ video frames.
Performance on extremely long contexts (1M model) sees some regression on short tasks (e.g., MMMU) compared to the 16K/128K versions, suggesting a trade-off at extreme scales.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture and Large Language Models (LLMs)
Vision-Language Pre-training (alignment and instruction tuning)
Distributed training strategies (Data/Tensor/Pipeline/Context Parallelism)
Flash Attention and memory optimization techniques

Key Terms

LMM: Large Multi-Modal Model—an AI model capable of processing and generating content across multiple modalities like text, images, and video

Context Parallelism: A distributed training/inference technique where the sequence of tokens is split across multiple GPUs to handle contexts longer than a single GPU's memory

Logits-Masked Language Modeling Head: An optimization where the final classification layer (head) only computes predictions for relevant positions (e.g., the last token) rather than the entire sequence, saving significant memory

Prefill: The initial phase of LLM inference where the model processes the input prompt (all history tokens) to generate the Key-Value cache before generating new tokens

SFT: Supervised Fine-Tuning—training a model on labeled examples (instructions and outputs) to improve its ability to follow user commands

RoPE: Rotary Positional Embeddings—a method for encoding token positions in Transformers that generalizes better to sequence lengths not seen during training

MME: A comprehensive evaluation benchmark for multimodal large language models

Hallucination: When a model generates incorrect or nonsensical information not supported by the input (e.g., describing an object not present in the image)

VQA: Visual Question Answering—the task of answering natural language questions about an image