olmOCR 2: Unit Test Rewards for Document OCR

📝 Paper Summary

Optical Character Recognition (OCR) Document Understanding

olmOCR 2 improves document parsing by training a vision language model with reinforcement learning using synthetic, binary unit tests—like checking specific table values—as reward signals.

Core Problem

Standard OCR evaluation metrics like edit distance fail to account for valid variations in complex elements like tables or math formulas, and continuous scores do not correlate well with practical correctness.

Why it matters:

Floating elements like tables or figures lack a single definitive ground truth representation, meaning valid outputs are often penalized by rigid string matching
Continuous scoring functions often weight trivial errors (like caption placement) equally with critical content errors, failing to capture human-centric notions of correctness
Existing benchmarks struggle to reliably evaluate the conversion of math-heavy or multi-column scientific documents into linear text

Concrete Example: A math formula can be represented in LaTeX in multiple ways that render visually identical results. Edit distance penalizes a valid but different LaTeX string, whereas a unit test checking the visual rendering (via KaTeX) correctly identifies it as a match.

Key Novelty

Unit Test Rewards for RLVR (Reinforcement Learning with Verifiable Rewards)

Instead of training against a static text ground truth, the system generates executable 'unit tests' for each training document (e.g., 'Does the phrase X appear?', 'Is value Y in table cell Z?')
These binary pass/fail tests serve as the reward signal for reinforcement learning, allowing the model to optimize for functional correctness rather than strict string matching

Evaluation Highlights

Achieves a +14.2 point overall improvement on the olmOCR-Bench compared to the initial olmOCR release (February 2025)
Demonstrates largest improvements in converting math formulas, parsing tables, and handling multi-column layouts compared to previous versions
Validates the efficiency of dynamic temperature scaling, which prevents repetition loops while maintaining the quality benefits of lower-temperature sampling

Breakthrough Assessment

8/10

Significant methodology shift from supervised text matching to RL-based functional verification for OCR. Addresses a fundamental flaw in OCR metrics and achieves SOTA results.

⚙️ Technical Details

Problem Definition

Setting: Convert rasterized document images (PDF pages) into clean, naturally ordered plain text/Markdown

Inputs: Image of a document page (1288px longest edge)

Outputs: Linearized text representation (Markdown/YAML) including parsed tables and equations

Pipeline Flow

Input Processing (Resizing & Prompting)
Inference (VLM with Dynamic Temperature)
Output Formatting (YAML)

System Modules

Input Preprocessor

Prepare image and prompt for the model

Model or implementation: N/A

OCR Generation

Generate the textual representation of the document

Model or implementation: olmOCR-2-7B-1025 (Souped Qwen2.5-VL-7B)

Output Parser

Format the generation into final structured text

Model or implementation: Rule-based

Novel Architectural Elements

Integration of dynamic temperature scaling directly into the inference loop to handle VLM repetition issues without sacrificing low-temperature quality

Modeling

Base Model: Qwen2.5-VL-7B-Instruct

Training Method: Reinforcement Learning with Verifiable Rewards (RLVR) using Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize functional correctness of OCR output.

Formally: Reward = fraction (0.0 to 1.0) of passing unit tests per document.
Purpose: Ensure valid formatting.

Formally: Binary reward for presence of EOS token.
Purpose: Ensure metadata capture.

Formally: Reward (0-1) for correct extraction of document metadata (e.g., language, rotation).

Adaptation: Full fine-tuning followed by model souping (averaging weights of 6 runs)

Training Data:

olmOCR2-synthmix-1025: 2,186 PDF pages with 30,381 synthetic unit tests
Tests include: Text Presence, Text Absence, Reading Order, Table Accuracy, Math Formula Accuracy

Key Hyperparameters:

kl_beta: 0.01
epochs: 1 (SFT) + 1 (RL)
completions_per_document: 28

Compute: 8xH100 GPU node

Comparison to Prior Work

vs. Infinity Parser: Uses binary unit tests (checking semantics/rendering) as rewards instead of edit distance and structural consistency metrics
vs. MinerU/Marker: End-to-end VLM approach rather than a pipeline of specialized models
vs. GPT-4o: Distills capabilities into a 7B model using synthetic data generated by the larger model

Limitations

Dependency on closed-source VLM (Claude) for generating synthetic ground truth and unit tests
Dynamic temperature scaling adds complexity to the inference loop
Training requires generating many completions (28 per doc) which is compute intensive
Blank page handling required specific bug fixes in data loading to prevent hallucinations

Reproducibility

Code: https://github.com/allenai/olmocr

Publicly available: model weights, training data (olmOCR-mix-1025, olmOCR2-synthmix-1025), and code. Data generation pipeline uses Claude-Sonnet-4-20250514 (closed source dependency).

📊 Experiments & Results

Evaluation Setup

OCR conversion of diverse PDF documents

Benchmarks:

olmOCR-Bench (Document-to-Text Conversion) [New]

Metrics:

Unit Test Pass Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
olmOCR-Bench	Overall Score	Not reported in the paper	Not reported in the paper	+14.2

Experiment Figures

The synthetic data generation pipeline for creating training data and unit tests

Example of the reward signal used during RL training

Main Takeaways

Reinforcement learning with unit test rewards yields state-of-the-art performance, outperforming previous versions by 14.2 points on olmOCR-Bench.
The use of binary unit tests is particularly effective for improving extraction of structured data like equations, tables, and multi-column layouts compared to text-matching baselines.
Switching output format from JSON to YAML significantly reduced generation retry rates and improved inference efficiency.
Standardizing the prompt order (text then image) between training and inference proved critical for optimal performance.

📚 Prerequisite Knowledge

Prerequisites

Vision Language Models (VLMs)
Reinforcement Learning (PPO/GRPO)
Optical Character Recognition (OCR)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training models using objective, binary feedback (pass/fail) rather than a learned reward model

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs for the same input, estimating baselines from the group average

Unit Tests: Specific, executable checks generated from ground truth (e.g., 'text exists', 'math renders correctly') used as binary reward signals

Souping: Model Souping—averaging the weights of multiple fine-tuned models (often trained with different seeds) to improve robustness and performance

KaTeX: A fast math typesetting library for the web; used here to verify if OCR'd LaTeX equations render visually identical to the ground truth

VLM: Vision Language Model—a multimodal AI model capable of processing both image and text inputs

SFT: Supervised Fine-Tuning—the initial phase of training a model on labeled examples before applying reinforcement learning