olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

📝 Paper Summary

Document Parsing / OCR Data Curation for LLMs Vision Language Models (VLMs)

olmOCR is an open-source toolkit that converts PDFs into high-quality, linearized plain text using a fine-tuned 7B VLM, achieving state-of-the-art accuracy at significantly lower cost than proprietary APIs.

Core Problem

PDFs encode visual placement rather than logical text structure, making it difficult to extract coherent text for language model training using traditional tools or expensive proprietary VLMs.

Why it matters:

Training LMs on trillions of tokens requires high-quality data; noisy extraction from PDFs leads to training instabilities and poor downstream performance
Traditional tools (Tesseract) struggle with complex layouts, while SOTA proprietary models (GPT-4o) are prohibitively expensive ($6,000+ per million pages) for pretraining scales

Concrete Example: A PDF page containing a multi-column scientific paper with floating figures often results in garbled reading order (e.g., reading across columns) or missing formulas when processed by standard tools, whereas olmOCR linearizes it correctly into Markdown.

Key Novelty

Document-Anchored Distillation for PDF Linearization

Uses 'document-anchoring'—augmenting VLM prompts with raw, noisy text extracted from PDF metadata—to help a teacher model (GPT-4o) generate high-quality linearized ground truth
Distills this capability into a smaller, efficient open-source model (Qwen2-VL-7B) specialized for document processing, replacing the need for expensive API calls
Introduces a 'unit-test' based benchmark (olmOCR-Bench) that validates extraction using deterministic binary rules (e.g., 'does this exact formula exist?') rather than fuzzy text matching

Evaluation Highlights

Processes 1 million PDF pages for $176 USD, compared to ~$6,240 USD for GPT-4o (a ~35x cost reduction)
olmOCR-Bench includes 7,010 unit test cases across 1,402 documents, covering challenging elements like math formulas, tables, and tiny text
Outperforms Qwen-2.5-VL-7B and proprietary models like GPT-4o on the proposed benchmark (specific accuracy percentages not provided in snippet)

Breakthrough Assessment

9/10

Solves a critical bottleneck in LLM data curation (high-quality PDF-to-text) with a fully open-source pipeline that rivals proprietary models in quality while drastically reducing cost.

⚙️ Technical Details

Problem Definition

Setting: End-to-end conversion of visual document pages (PDF/images) into linearized plain text (Markdown)

Inputs: Raster image of a PDF page (and optionally raw text anchors)

Outputs: Linearized plain text string formatted in Markdown, preserving reading order and structures like tables/math

Pipeline Flow

PDF Rasterization (convert page to image)
Document Anchoring (extract raw text hints)
VLM Inference (olmOCR-7B processes image+hints)
Structured Output (JSON/Markdown generation)

System Modules

PDF Rasterizer (Input Processing)

Converts PDF pages into high-resolution images for the VLM

Model or implementation: Standard rendering tools (e.g., poppler/pdf2image)

Anchor Extractor (Input Processing)

Extracts noisy text and layout metadata from the PDF to guide the VLM

Model or implementation: pypdf library

Context Extractor

Generates clean, linearized text from the visual input and text anchors

Model or implementation: olmOCR-7B-0225-preview (Qwen2-VL-7B-Instruct based)

Novel Architectural Elements

Integration of raw PDF text extraction (document anchors) directly into the VLM prompt as a guidance signal during inference (architecture-level fusion of symbolic and visual data)

Modeling

Base Model: Qwen2-VL-7B-Instruct

Training Method: Supervised Fine-Tuning (SFT) on synthetic data

Adaptation: Full fine-tuning

Training Data:

olmOCR-mix-0225: 260,000 PDF pages paired with GPT-4o generated text
Data sourced from 240M crawled web PDFs and Internet Archive books
Filtered for English, non-spam, and valid content

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 4 (effective)
optimizer: AdamW
+ 4 more
scheduler: Cosine annealing
training_steps: 10,000 (approx 1.2 epochs)
max_image_dimension: 1024 pixels
max_input_length: 8192 tokens

Compute: Training: 16 node hours on 8x NVIDIA H100 (80GB) GPUs (365 total node hours including experiments). Inference: Scalable via vLLM/SGLang.

Comparison to Prior Work

vs. Nougat: General purpose (not just scientific), uses document anchoring for better fidelity
vs. GPT-4o: Significantly cheaper ($176 vs $6240 per 1M pages) and open weights
vs. Marker: Single end-to-end VLM approach rather than a pipeline of heuristic components
+ 1 more
vs. GOT Theory 2.0: Focuses on massive scale batch processing and diverse document types (books, scans, forms) beyond just theory papers

Limitations

Reliance on GPT-4o for training data generation means the model inherits GPT-4o's biases or errors
Document anchoring requires PDFs with extractable internal text; purely image-based PDFs (scans without text layer) rely solely on vision capabilities
Evaluation is strictly binary (pass/fail) which may miss nuance in partial extractions

Reproducibility

Code: https://github.com/allenai/olmocr

publicly available (https://github.com/allenai/olmocr). Artifacts released: Training data (olmOCR-mix-0225), Model (olmOCR-7B-0225-preview), Benchmark (olmOCR-Bench), and efficient inference pipeline code. Filtering heuristics code also provided.

📊 Experiments & Results

Evaluation Setup

Evaluation on olmOCR-Bench, a suite of unit-tests for PDF extraction.

Benchmarks:

olmOCR-Bench (PDF Content Extraction & Linearization) [New]

Metrics:

Percentage of unit tests passed (Pass Rate)
Cost per million pages ($)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Cost analysis demonstrates extreme efficiency improvements over using proprietary models for large-scale processing.
N/A (Cost Analysis)	Cost per 1M pages ($)	6240	176	-6064
N/A	Training Pages	0	260000	+260000
olmOCR-Bench	Test Cases	0	7010	+7010

Experiment Figures

Illustration of raw PDF internal storage vs. logical structure

Main Takeaways

olmOCR achieves state-of-the-art performance on the proposed benchmark, outperforming general purpose VLMs like GPT-4o and Qwen-2.5-VL (quantitative accuracy scores not provided in text snippet).
The 'document-anchoring' technique (injecting raw PDF text hints) measurably improves generation quality compared to vision-only approaches.
Training on linearized PDF data (olmOCR-peS2o) leads to observable downstream improvements in language model pretraining performance compared to baselines.
Cost reduction allows for processing typically inaccessible data scales (trillions of tokens) within reasonable academic/research budgets.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision Language Models (VLMs)
Familiarity with OCR (Optical Character Recognition) challenges
Basic knowledge of PDF file structure and rendering

Key Terms

linearization: The process of converting a 2D document layout (with columns, sidebars, floating figures) into a coherent 1D string of text that follows natural reading order

document-anchoring: A prompting technique where noisy text extracted from the PDF file's internal metadata is provided to the VLM alongside the page image to improve OCR accuracy and reduce hallucinations

VLM: Vision Language Model—a multimodal model capable of processing both images and text

OCR: Optical Character Recognition—the conversion of images of typed or handwritten text into machine-encoded text

pypdf: A python library used to extract internal structure and metadata from PDF files

SGLang: A high-performance inference engine for large language models and VLMs, used here for efficient batch processing

unit-test: In this paper, a deterministic pass/fail check used for evaluation (e.g., 'Does the output contain the string X?', 'Is string A before string B?')

NFC format: Normalization Form C—a Unicode normalization standard used to ensure consistent text representation

LaTeX: A typesetting system commonly used for scientific and mathematical documents; used here as a reference format for math formula tests