← Back to Paper List

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Said Taghadouini, Adrien Cavailles, Baptiste Aubertin
LightOn
arXiv.org (2026)
MM Pretraining RL Benchmark

📝 Paper Summary

Document Parsing Optical Character Recognition (OCR) Vision-Language Models (VLMs)
LightOnOCR is a compact 1B-parameter vision-language model that achieves state-of-the-art OCR by training end-to-end on high-resolution renders and refining specific failure modes (math, loops, localization) via Reinforcement Learning with Verifiable Rewards.
Core Problem
Traditional OCR relies on brittle multi-stage pipelines (detection, layout analysis, recognition) that are hard to adapt, while existing end-to-end models often struggle with high costs, scientific notation, and dense layouts.
Why it matters:
  • Pipelines couple multiple components, making improvements costly (e.g., fixing table extraction requires re-annotating intermediate representations)
  • Scientific PDFs mix dense typography, math, and noisy scans, which general-purpose OCR systems frequently misinterpret
  • End-to-end VLMs offer continuous improvement via fine-tuning but typically require massive parameter counts to handle complex documents effective
Concrete Example: Scientific PDFs often contain complex math formulas mixed with text. Standard pipelines might fragment the formula or misread symbols. LightOnOCR-2 uses specific TeX-based supervision to generate clean LaTeX code directly from pixels, avoiding the fragmentation issues of detection-based systems.
Key Novelty
Compact End-to-End OCR with RLVR Refinement
  • Integrates OCR and image localization into a single 1B-parameter model using a high-resolution (1540px) vision encoder and a specialized decoder
  • Applies Reinforcement Learning with Verifiable Rewards (RLVR) to fix non-differentiable failure modes like repetition loops, math rendering errors, and bounding box inaccuracies
  • Uses a 'resume strategy' for bounding box training: introducing coordinate supervision midway through pretraining to add localization capabilities without degrading text recognition
Evaluation Highlights
  • Achieves state-of-the-art results on OlmOCR-Bench, outperforming 9B-scale baselines despite being 9x smaller
  • Training mix scaled to 43M pages (2.5x larger than v1) with high-resolution 1540px input improves legibility for dense scientific content
  • Introduces LightOnOCR-bbox-bench, a new benchmark for document image localization, reporting F1 and IoU metrics
Breakthrough Assessment
8/10
Strong engineering contribution demonstrating that compact (1B) models can beat much larger ones in OCR via better data curation (nvpdftex) and targeted RLVR, efficiently unifying text extraction and localization.
×