GLM-OCR Technical Report

📝 Paper Summary

Document Understanding Multimodal Large Language Models (MLLMs) Optical Character Recognition (OCR)

GLM-OCR is a compact 0.9B multimodal model that achieves state-of-the-art document understanding by integrating explicit layout analysis with a multi-token prediction mechanism for high-speed, structured text generation.

Core Problem

Existing Multimodal LLMs are too computationally heavy and slow for production deployment, while traditional OCR pipelines lack semantic understanding and struggle with complex, non-standard layouts.

Why it matters:

Real-world production systems (e.g., financial reporting) require high throughput and low latency, which large autoregressive models cannot provide
Standard token-by-token generation is inefficient for OCR, which is a deterministic task with strong local dependencies (e.g., table syntax)
Small-scale models typically suffer from hallucinations and repetition when processing complex layouts without explicit structural guidance

Concrete Example: When processing a complex table, a standard small VLM might hallucinate the reading order or generate broken Markdown tags due to a lack of planning. GLM-OCR avoids this by first cropping the table region (via layout analysis) and then using Multi-Token Prediction to generate coherent structural tags (e.g., `<td>...</td>`) in blocks, reducing syntax errors.

Key Novelty

Compact Layout-Aware MTP Architecture

Integrates a standalone layout analysis module (PP-DocLayout-V3) before the generative model to decompose complex pages into simpler regions, preventing reading order confusion
Employs Multi-Token Prediction (MTP) during both training and inference, allowing the model to predict multiple tokens per step (5.2 effective tokens/step) to boost speed and structural consistency
Unifies Document Parsing (transcription) and Key Information Extraction (KIE) into a single conditional generation framework handled by a compact 0.9B model

Architecture

The overall GLM-OCR system architecture, illustrating the Vision Encoder, LLM Decoder, and the Multi-Token Prediction heads.

Evaluation Highlights

Achieves 94.6 on OmniDocBench v1.5, ranking first among all evaluated models including larger general VLMs and specialized peers like PaddleOCR-VL-1.5
Attains 93.7 on Nanonets-KIE, setting a new state-of-the-art for open-source models and outperforming GPT-5.2 (87.5)
Delivers ~50% inference throughput improvement (5.2 tokens/step) via Multi-Token Prediction compared to standard autoregressive decoding

Breakthrough Assessment

8/10

Significant engineering achievement in efficiency/performance trade-offs. Successfully adapts Multi-Token Prediction to OCR and beats much larger models with a sub-1B parameter footprint.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Document Understanding encompassing Document Parsing (image-to-markdown/JSON) and Key Information Extraction (image-to-fields)

Inputs: Document image $I$ and optional task prompt $P$

Outputs: Structured text sequence $Y$ (Markdown or JSON)

Pipeline Flow

Stage 1: Layout Analysis (PP-DocLayout-V3)
Stage 2: Parallel Region Recognition (GLM-OCR Core)
Stage 3: Merge & Post-Process

System Modules

Layout Analyzer

Detect and classify document regions (paragraphs, tables, formulas) to decompose complex layouts

Model or implementation: PP-DocLayout-V3

GLM-OCR Core

Generate structured text from visual regions using multi-token prediction

Model or implementation: 0.9B Total (0.4B CogViT Encoder + 0.5B GLM Decoder)

Merge & Post Process

Reassemble regional outputs into valid Markdown/JSON based on reading order

Model or implementation: Rule-based module

Novel Architectural Elements

Integration of Multi-Token Prediction (MTP) with shared parameters in the decoder specifically optimized for OCR's deterministic nature
Hybrid pipeline combining explicit layout detection model with a generative VLM to handle complex page structures via decomposition

Modeling

Base Model: GLM-OCR (0.9B parameters total)

Training Method: Multi-stage training: Alignment -> SFT -> RL (GRPO)

Objective Functions:

Purpose: Align vision and language features.

Formally: Joint optimization of Image-Text matching and Autoregressive generation
Purpose: Enable fast inference.

Formally: Multi-Token Prediction (MTP) loss where auxiliary heads predict future tokens at offsets $t+1...t+k$
Purpose: Refine structured generation.

Formally: GRPO (Group Relative Policy Optimization) using accuracy and format-validity rewards

Adaptation: Full fine-tuning of the 0.9B model

Trainable Parameters: 0.9B

Training Data:

Stage 1: Tens of billions of image-text pairs (MIM + CLIP)
Stage 2: Mixed data including document parsing, grounding, VQA, and curated OCR datasets

Key Hyperparameters:

MTP_tokens_per_step: 10 (trained to predict)
MTP_average_generation: 5.2 (realized at inference)

Compute: Supports deployment on vLLM, SGLang, Ollama; suitable for edge devices due to 0.9B size

Comparison to Prior Work

vs. PaddleOCR-VL-1.5: GLM-OCR uses MTP for faster inference and achieves slightly higher parsing accuracy (94.6 vs 94.5)
vs. General MLLMs (e.g. Gemini-3-Pro): GLM-OCR is significantly smaller (0.9B vs Billions) yet competitive or superior on OCR benchmarks
vs. GOT-OCR-2.0: GLM-OCR integrates a dedicated layout analysis module rather than relying purely on end-to-end generation [not cited in paper]

Limitations

Explicit layout analysis step may introduce bottlenecks if the layout model fails on novel document types
Smaller parameter count (0.9B) may limit general world knowledge compared to large MLLMs, though sufficient for OCR
Performance on PubTabNet (85.2) trails the best competitor MinerU2.5 (88.4)

Reproducibility

Code: https://github.com/zai-org/GLM-OCR

Code publicly available at https://github.com/zai-org/GLM-OCR. Finetuning supported via LLaMA-Factory. Model weights implied to be available (open-weight model).

📊 Experiments & Results

Evaluation Setup

Comprehensive evaluation on both public academic benchmarks and in-house industrial scenarios covering parsing, KIE, and recognition.

Benchmarks:

OmniDocBench v1.5 (Document Parsing)
Nanonets-KIE (Key Information Extraction)
OCRBench (Text) (Text Recognition)
PubTabNet (Table Recognition)

Metrics:

Accuracy / Score (custom per benchmark)
Edit Distance
Inference Throughput
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Document Parsing benchmarks show GLM-OCR achieving top-tier performance despite its small size, particularly excelling in general document structure recovery.
OmniDocBench v1.5	Overall Score	94.5	94.6	+0.1
OCRBench (Text)	Score	75.3	94.0	+18.7
PubTabNet	Score	88.4	85.2	-3.2
In Key Information Extraction (KIE), GLM-OCR outperforms open-source baselines and even surpasses some proprietary models.
Nanonets-KIE	Score	87.5	93.7	+6.2
Receipt KIE (In-house)	Score	83.5	94.5	+11.0

Experiment Figures

Comparison of OmniDocBench scores across various models.

Main Takeaways

GLM-OCR proves that a compact 0.9B model, when specialized with layout analysis and MTP, can outperform significantly larger general-purpose MLLMs on document tasks.
The Multi-Token Prediction mechanism is highly effective for OCR, delivering a ~50% increase in throughput without sacrificing accuracy.
The model generalizes well to real-world noisy scenarios, showing strong results in seal recognition (90.5) and handwritten text (87.0), areas where traditional OCR often fails.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Vision Encoders and Language Decoders)
Autoregressive generation vs. Speculative decoding
Basic OCR concepts (Layout Analysis, Text Recognition)

Key Terms

MTP: Multi-Token Prediction—a decoding strategy where the model predicts multiple future tokens simultaneously per step to speed up inference

KIE: Key Information Extraction—identifying and extracting specific entities (e.g., 'Total Amount') from documents into structured formats like JSON

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used here to align the model's structured outputs with correctness rewards

CogViT: A Vision Transformer variant used as the visual encoder in the GLM framework

MIM: Masked Image Modeling—a pre-training objective where parts of an image are masked and the model must reconstruct them

SFT: Supervised Fine-Tuning—training the model on labeled task data after pre-training