InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

📝 Paper Summary

Long-context Vision-Language Models (VLMs) Efficient Attention Mechanisms Linear Complexity Architectures

InfiniteVL combines Gated DeltaNet for linear-complexity long-term memory with Sliding Window Attention for local detail, enabling infinite context processing with constant memory usage and high throughput.

Core Problem

Existing VLMs struggle with a trade-off: window-based models lose long-term context, while linear attention models often lose fine-grained visual details needed for OCR and document tasks.

Why it matters:

Quadratic complexity of Transformers prohibits processing long videos or continuous agent interactions on edge devices due to memory bottlenecks.
Pure linear attention models (like Mamba or RWKV variants) historically underperform on information-intensive tasks like OCR compared to full-attention models.
Resource constraints on edge devices require constant memory footprint to prevent Out-of-Memory errors during long streaming sessions.

Concrete Example: In streaming video understanding, a standard Transformer-based VLM (e.g., Qwen2.5-VL-3B) decays from ~10 FPS to <1 FPS after 200 frames and crashes (OOM) at frame 294 due to KV cache growth. InfiniteVL maintains a stable 24 FPS indefinitely.

Key Novelty

Hybrid Gated DeltaNet + Sliding Window Attention (InfiniteVL)

Interleaves Gated DeltaNet layers (efficient, compressed global memory state) with Sliding Window Attention layers (precise local context) to balance long-range retention and fine-grained perception.
Uses a 3-stage training pipeline (Distillation → Instruction SFT → Long-sequence SFT) to transfer knowledge from Transformers to the linear architecture efficiently.

Architecture

The InfiniteVL architecture pipeline and the internal structure of the Hybrid Block.

Evaluation Highlights

Achieves >3.6x inference speedup vs. Transformer baselines at 50K context length, maintaining constant memory footprint.
Sustains stable 24 FPS throughput in streaming video, whereas baselines crash after ~300 frames.
Matches performance of leading Transformer VLMs (e.g., Qwen2.5-VL-3B) on OCR and Document Understanding benchmarks, areas where linear models typically struggle.

Breakthrough Assessment

8/10

Successfully bridges the gap between linear attention efficiency and Transformer-level performance on detail-oriented tasks, solving the 'detail vs. context' trade-off while enabling constant-memory streaming.

⚙️ Technical Details

Problem Definition

Setting: Multimodal sequence modeling where input X (images/video + text) maps to text output Y, with potentially unlimited sequence length L.

Inputs: Multimodal token sequence X (interleaved visual and text tokens)

Outputs: Textual response Y (e.g., answer, caption)

Pipeline Flow

Vision Encoder (processes images/video)
Projection MLP (maps visual tokens to LLM space)
Hybrid LLM Backbone (processes multimodal sequence)

System Modules

Vision Encoder (Input Processing)

Extract visual features from images or video frames

Model or implementation: Qwen2.5-VL Vision Encoder

Projection MLP (Input Processing)

Project visual tokens to the dimension of the language model

Model or implementation: Lightweight MLP

Hybrid LLM Backbone

Autoregressive generation combining local and global context

Model or implementation: 9 Hybrid Blocks (1 SWA layer + 3 Gated DeltaNet layers per block)

Novel Architectural Elements

Interleaved layer design: 1 Sliding Window Attention (SWA) layer followed by 3 Gated DeltaNet layers in each block (repeated 9 times).
Integration of 1D convolution (window size 4) and output gate within the Gated DeltaNet module for enhanced expressiveness.

Modeling

Base Model: Initialized from Qwen2.5-VL-3B (Vision Encoder + specific weights reuse)

Training Method: Three-stage pipeline: Distillation Pretraining → Instruction SFT → Long-sequence SFT

Objective Functions:

Purpose: Distill layer-wise representations from teacher to student.

Formally: MSE loss between student and teacher hidden states for corresponding layers.
Purpose: Distill output probability distributions.

Formally: KL divergence between teacher and student output logits.
Purpose: Standard next-token prediction for fine-tuning.

Formally: Cross-entropy loss on target tokens.

Adaptation: LoRA used specifically in Stage III (Long-sequence SFT)

Training Data:

Stage I: 1M Captions & QA pairs (max len 8192)
Stage II: 8M Multimodal QA pairs (max len 8192)
Stage III: 200K Long-video QA (max len 32K) + 800K SFT mix

Key Hyperparameters:

learning_rate: Stage I: 2e-4; Stage II: 5e-5; Stage III: 2e-5
batch_size: Stage I: 64; Stage II: 256; Stage III: 64
weight_decay: 0.01
+ 2 more
optimizer: AdamW
precision: bfloat16 (BF16)

Compute: Trained on NVIDIA H20 GPUs. Inference tested on NVIDIA RTX 4090.

Comparison to Prior Work

vs. Qwen2.5-VL (Transformer): InfiniteVL uses constant memory (vs. linear growth) and linear complexity (vs. quadratic), enabling infinite streaming.
vs. Pure Linear Models (Mamba/RWKV): InfiniteVL incorporates Sliding Window Attention layers to recover local fine-grained details (OCR, charts) where pure linear models typically fail.
vs. Window-only models: InfiniteVL retains global context via Gated DeltaNet states, avoiding the 'amnesia' of windowed approaches.

Limitations

Initially underperforms full Transformers on short sequences before sufficient fine-tuning.
Requires a complex three-stage training recipe to work effectively.
Hybrid architecture is non-standard, potentially complicating deployment compared to pure Transformers.

Reproducibility

Training data sources are open-source (FineVision, LLaVA-OneVision, etc.). Code URL is not explicitly provided in the text snippet. Model architecture details (layer counts, head counts) are specified.

📊 Experiments & Results

Evaluation Setup

Evaluated on standard multimodal benchmarks and custom long-context video tasks.

Benchmarks:

OCRBench (OCR)
DocVQA (Document Understanding)
Video-MME (Long Video Understanding)
LongVideoBench (Long Video Understanding)
MMBench (General Multimodal)

Metrics:

Accuracy
Score (benchmark specific)
Frames Per Second (FPS)
Memory Usage (GB)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against similar-sized Transformer and Linear baselines on standard multimodal benchmarks.
OCRBench	Score	727	703	-24
DocVQA	Score	89.0	86.5	-2.5
Efficiency metrics demonstrating the advantages of the linear architecture.
Streaming Inference (RTX 4090)	Throughput (FPS)	1	24	+23
Long Context Inference (300K tokens)	Memory (GB)	OOM	9	N/A

Experiment Figures

Performance on video benchmarks vs. frame count, and inference efficiency (latency/memory) vs. sequence length.

Main Takeaways

InfiniteVL effectively bridges the gap between efficient linear models and high-performing Transformers, especially on detail-heavy tasks like OCR.
The hybrid architecture provides a 'best of both worlds' solution: constant memory/latency for long contexts (from DeltaNet) and high local precision (from Window Attention).
Streaming video understanding is a standout capability, with the model maintaining real-time speeds indefinitely where Transformers fail rapidly.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention mechanism)
Linear Attention and State Space Models (SSMs)
Vision-Language Model pretraining pipelines

Key Terms

Gated DeltaNet: A linear attention variant that uses a recurrent update rule with gating and Householder-like rotations to maintain a compressed memory state without growing KV cache.

SWA: Sliding Window Attention—attention mechanism restricted to a fixed local window of recent tokens, reducing complexity from quadratic to linear but losing distant context.

KV cache: Key-Value cache—storage of intermediate attention states in Transformers; grows linearly with sequence length, causing memory bottlenecks.

RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformer models.

Distillation: Training a smaller or more efficient 'student' model to mimic the outputs of a larger 'teacher' model.

SFT: Supervised Fine-Tuning—training on labeled instruction-following data.

LoRA: Low-Rank Adaptation—parameter-efficient fine-tuning method that freezes main weights and trains small rank-decomposition matrices.

Householder rotation: A linear transformation used in Gated DeltaNet to reorient the memory matrix, preventing low-rank collapse during updates.