DocVLM: Make Your VLM an Efficient Reader

📝 Paper Summary

Document Understanding Vision-Language Models (VLMs) Efficient Multimodal Learning

DocVLM integrates a compressed OCR modality into Vision-Language Models to achieve high-performance document understanding with significantly fewer visual tokens, enabling efficient processing of high-resolution and multi-page documents.

Core Problem

Standard Vision-Language Models struggle with document understanding because high-resolution images require excessive computational tokens, while low-resolution inputs lose critical text details.

Why it matters:

Document analysis requires processing high-resolution inputs to read dense text, which is computationally expensive (quadratic cost for transformers).
Existing methods that reduce image resolution or token counts (like downsampling) suffer significant performance drops in text-heavy tasks.
Feeding raw OCR text directly into the prompt loses spatial layout context and creates prohibitively long sequences for multi-page documents.

Concrete Example: When processing a document with InternVL2 limited to a single 448x448 tile (256 visual tokens), performance on DocVQA drops to 56.0% because the text becomes illegible. DocVLM restores this to 86.6% using the same visual budget by injecting compressed OCR queries.

Key Novelty

Instruction-Aware OCR Compression

Uses a separate OCR encoder to process text and bounding box data, then compresses this variable-length sequence into a fixed set of learnable queries (typically 64) via cross-attention.
These compressed queries are injected into the VLM alongside visual tokens, providing high-fidelity text/layout information without the computational cost of high-resolution image tokens.

Architecture

The DocVLM inference pipeline integrating OCR compression with a standard VLM.

Evaluation Highlights

+30.6% accuracy on DocVQA (56.0% → 86.6%) when integrated with InternVL2 under a strict 256 visual token limit.
Achieves state-of-the-art 86.3% on MP-DocVQA (multi-page) using 80% fewer tokens than standard high-resolution approaches.
Outperforms Qwen2-VL baseline on TextVQA (82.8% vs 79.4%) while using significantly restricted visual inputs (576 tokens).

Breakthrough Assessment

8/10

Highly practical solution for the resolution-efficiency trade-off in VLMs. The ability to compress OCR data into just 64 tokens while beating full-resolution baselines is a significant efficiency breakthrough for document tasks.

⚙️ Technical Details

Problem Definition

Setting: Visual Document Understanding (VDU) where a model must answer questions based on a document image containing dense text and complex layout.

Inputs: Document image I, User instruction/question Q, Extracted OCR data (text + bounding boxes).

Outputs: Textual answer A.

Pipeline Flow

Input Processing: Image → Visual Encoder; OCR Data → OCR Encoder
Compression: OCR Features + Learnable Queries → Compressed OCR Tokens
Integration: Compressed OCR Tokens + Visual Tokens → LLM → Answer

System Modules

OCR System (Input Processing)

Extract text and 2D bounding box coordinates from the document image

Model or implementation: External OCR engine (implied, not specified)

Visual Encoder (Input Processing)

Process the image to generate visual embeddings

Model or implementation: Varies (e.g., SigLIP for LLaVA, internal encoder for Qwen2-VL)

OCR Encoder (OCR Integration)

Encode OCR text and layout information

Model or implementation: DocFormerV2 (encoder only, visual branch removed)

Query Compression Mechanism (OCR Integration)

Compress variable-length OCR features into a fixed small set of tokens

Model or implementation: Cross-attention layer

LLM

Generate the final answer using combined visual and OCR context

Model or implementation: Varies (e.g., Qwen2-7B, LLaMA-3-8B)

Novel Architectural Elements

Parallel OCR branch that injects compressed layout/text features directly into the LLM input space
Instruction-aware compression mechanism using learnable queries to distill OCR data into exactly 64 tokens

Modeling

Base Model: Evaluated on LLaVA-OneVision, InternVL2 (1B/2B/8B/26B), and Qwen2-VL-7B

Training Method: Two-stage alignment training while keeping the base VLM frozen

Objective Functions:

Purpose: Standard autoregressive language modeling.

Formally: Next-token prediction loss on the answer text.

Adaptation: Train only OCR Encoder, Learnable Queries, and Projection Layer

Trainable Parameters: OCR Encoder (344M) + Projection/Queries (small)

Training Data:

Stage 1 (OCR-LLM Alignment): Text-centric tasks (DocVQA, InfoVQA, ST-VQA, TextVQA, OCR-VQA, ChartQA, TextCaps, TAT-DQA). No images fed to VLM.
Stage 2 (Vision Alignment): Adds visual datasets (COCO Caption, VQA-V2) and feeds images to VLM to align modalities.

Key Hyperparameters:

num_learnable_queries: 64
visual_token_limit_internvl: 256 (1 tile) or 1280 (4 tiles)
visual_token_limit_qwen: 256 or 576

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard VLMs (Qwen2-VL, InternVL2): DocVLM uses significantly fewer visual tokens (e.g., 256 vs thousands) by offloading text reading to the compressed OCR module.
vs. Text-Prompting OCR (e.g., prompted LLaVA): DocVLM compresses OCR into fixed vectors (64 tokens) preserving layout, whereas text prompting creates massive token sequences that increase latency.
vs. TokenPacker/DocCompressor: DocVLM compresses extracted OCR metadata, whereas these methods compress visual features (ViT outputs), which often degrades text legibility.

Limitations

Relies on the quality of an external OCR system; poor OCR will degrade performance.
Trained only on single-page data (though shown to generalize to multi-page).
Requires an additional encoder (DocFormerV2), adding some parameter overhead (344M) compared to pure vision-only inference.

Reproducibility

Code availability is not explicitly provided in the text. DocFormerV2 and base VLMs are open-source. OCR system details (e.g., which OCR engine used) are not specified in the main text.

📊 Experiments & Results

Evaluation Setup

Document Visual Question Answering across single and multi-page documents.

Benchmarks:

DocVQA (Document VQA)
TextVQA (Scene Text VQA)
InfoVQA (Infographic VQA)
MP-DocVQA (Multi-page Document VQA)
DUDE (Multi-page Document Understanding)

Metrics:

ANLS
VQAScore (for TextVQA)
CIDEr (for TextCaps)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Impact of DocVLM on InternVL2-8B under strict visual token constraints (1 tile / 256 tokens).
DocVQA	ANLS	56.0	86.6	+30.6
InfoVQA	ANLS	35.3	63.7	+28.4
State-of-the-Art comparison using Qwen2-VL-7B with DocVLM (576 visual tokens).
DocVQA	ANLS	91.4	92.8	+1.4
TextVQA	VQAScore	79.4	82.8	+3.4
Zero-shot generalization to multi-page documents (MP-DocVQA).
MP-DocVQA	ANLS	80.3	86.3	+6.0

Experiment Figures

Performance vs. Visual Token count trade-off for DocVLM vs. Baselines (InternVL2, LLaVA, Qwen2-VL) on DocVQA.

Main Takeaways

DocVLM allows VLMs to operate in extremely low visual token regimes (e.g., 256 tokens) with performance comparable to or better than full-resolution models.
The method is model-agnostic, showing consistent gains across LLaVA, InternVL, and Qwen architectures.
The compressed OCR representation (64 queries) is sufficient to capture dense text and layout, outperforming raw OCR text insertion while being far more token-efficient.
Scales naturally to multi-page documents where context length is a bottleneck; 'Page-wise Encoding' (compressing each page to 64 tokens) yields SOTA results.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (architecture and tokenization)
Optical Character Recognition (OCR) systems
Transformer attention mechanisms (cross-attention)
Token compression techniques

Key Terms

OCR: Optical Character Recognition—technology that converts images of typed, handwritten, or printed text into machine-encoded text

VLM: Vision-Language Model—a model that combines computer vision and natural language processing to understand and generate content based on image and text inputs

LLM: Large Language Model—a deep learning algorithm that can recognize, summarize, translate, predict, and generate text

ANLS: Average Normalized Levenshtein Similarity—a metric commonly used in Visual Question Answering to measure the similarity between the predicted answer and the ground truth

CIDEr: Consensus-based Image Description Evaluation—a metric used to evaluate image captioning quality

learnable queries: Fixed vectors that act as 'slots' to aggregate information from a larger input source via attention mechanisms

prompt tuning: A technique where a small number of trainable parameters are added to the input prompt while keeping the rest of the model frozen

DocVQA: Document Visual Question Answering—a dataset for evaluating VQA on document images

DUDE: Document Understanding Dataset and Evaluation—a benchmark for multi-page document understanding