← Back to Paper List

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Dan Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, Luca Soldaini
Allen Institute for AI, Seattle, USA
arXiv.org (2025)
MM Benchmark Pretraining

📝 Paper Summary

Document Parsing / OCR Data Curation for LLMs Vision Language Models (VLMs)
olmOCR is an open-source toolkit that converts PDFs into high-quality, linearized plain text using a fine-tuned 7B VLM, achieving state-of-the-art accuracy at significantly lower cost than proprietary APIs.
Core Problem
PDFs encode visual placement rather than logical text structure, making it difficult to extract coherent text for language model training using traditional tools or expensive proprietary VLMs.
Why it matters:
  • Training LMs on trillions of tokens requires high-quality data; noisy extraction from PDFs leads to training instabilities and poor downstream performance
  • Traditional tools (Tesseract) struggle with complex layouts, while SOTA proprietary models (GPT-4o) are prohibitively expensive ($6,000+ per million pages) for pretraining scales
Concrete Example: A PDF page containing a multi-column scientific paper with floating figures often results in garbled reading order (e.g., reading across columns) or missing formulas when processed by standard tools, whereas olmOCR linearizes it correctly into Markdown.
Key Novelty
Document-Anchored Distillation for PDF Linearization
  • Uses 'document-anchoring'—augmenting VLM prompts with raw, noisy text extracted from PDF metadata—to help a teacher model (GPT-4o) generate high-quality linearized ground truth
  • Distills this capability into a smaller, efficient open-source model (Qwen2-VL-7B) specialized for document processing, replacing the need for expensive API calls
  • Introduces a 'unit-test' based benchmark (olmOCR-Bench) that validates extraction using deterministic binary rules (e.g., 'does this exact formula exist?') rather than fuzzy text matching
Evaluation Highlights
  • Processes 1 million PDF pages for $176 USD, compared to ~$6,240 USD for GPT-4o (a ~35x cost reduction)
  • olmOCR-Bench includes 7,010 unit test cases across 1,402 documents, covering challenging elements like math formulas, tables, and tiny text
  • Outperforms Qwen-2.5-VL-7B and proprietary models like GPT-4o on the proposed benchmark (specific accuracy percentages not provided in snippet)
Breakthrough Assessment
9/10
Solves a critical bottleneck in LLM data curation (high-quality PDF-to-text) with a fully open-source pipeline that rivals proprietary models in quality while drastically reducing cost.
×