← Back to Paper List

olmOCR 2: Unit Test Rewards for Document OCR

Jake Poznanski, Luca Soldaini, Kyle Lo
Allen Institute for AI
arXiv.org (2025)
MM RL Benchmark

📝 Paper Summary

Optical Character Recognition (OCR) Document Understanding
olmOCR 2 improves document parsing by training a vision language model with reinforcement learning using synthetic, binary unit tests—like checking specific table values—as reward signals.
Core Problem
Standard OCR evaluation metrics like edit distance fail to account for valid variations in complex elements like tables or math formulas, and continuous scores do not correlate well with practical correctness.
Why it matters:
  • Floating elements like tables or figures lack a single definitive ground truth representation, meaning valid outputs are often penalized by rigid string matching
  • Continuous scoring functions often weight trivial errors (like caption placement) equally with critical content errors, failing to capture human-centric notions of correctness
  • Existing benchmarks struggle to reliably evaluate the conversion of math-heavy or multi-column scientific documents into linear text
Concrete Example: A math formula can be represented in LaTeX in multiple ways that render visually identical results. Edit distance penalizes a valid but different LaTeX string, whereas a unit test checking the visual rendering (via KaTeX) correctly identifies it as a match.
Key Novelty
Unit Test Rewards for RLVR (Reinforcement Learning with Verifiable Rewards)
  • Instead of training against a static text ground truth, the system generates executable 'unit tests' for each training document (e.g., 'Does the phrase X appear?', 'Is value Y in table cell Z?')
  • These binary pass/fail tests serve as the reward signal for reinforcement learning, allowing the model to optimize for functional correctness rather than strict string matching
Evaluation Highlights
  • Achieves a +14.2 point overall improvement on the olmOCR-Bench compared to the initial olmOCR release (February 2025)
  • Demonstrates largest improvements in converting math formulas, parsing tables, and handling multi-column layouts compared to previous versions
  • Validates the efficiency of dynamic temperature scaling, which prevents repetition loops while maintaining the quality benefits of lower-temperature sampling
Breakthrough Assessment
8/10
Significant methodology shift from supervised text matching to RL-based functional verification for OCR. Addresses a fundamental flaw in OCR metrics and achieves SOTA results.
×