OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

📝 Paper Summary

Data attribution Fact checking Interpretability

OLMoTrace is a real-time system that traces language model outputs back to verbatim matches in multi-trillion-token training corpora using a fast parallel suffix array search.

Core Problem

Tracing LM outputs to training data is critical for understanding behavior, but existing methods like influence functions are too computationally expensive to scale to multi-trillion-token corpora in real time.

Why it matters:

Users need to understand why models generate specific responses, especially in high-stakes scenarios like fact-checking or detecting hallucination
Current behavior tracing methods are too slow or resource-heavy for interactive use with modern massive datasets (trillions of tokens)
Fully open models provide data access, but lack efficient tools to navigate that data verbatim at scale

Concrete Example: If a model outputs 'The space needle was built for the 1962 World Fair', a user might want to know if this is a hallucination or a memorized fact. Without OLMoTrace, searching 3.2 billion documents for this exact string is infeasible; OLMoTrace finds the original web source instantly.

Key Novelty

Real-time Verbatim Training Data Tracing via Suffix Arrays

Indexes the entire multi-trillion token training corpus using a suffix array (via infini-gram) to allow constant-time existence checks
Uses a novel parallel algorithm to find all maximal matching spans in a generated response by checking the longest common prefix of suffixes in a single pass
Filters matches by 'span unigram probability' rather than just length to identify unique, statistically significant phrases

Architecture

The five-step inference pipeline of OLMoTrace

Evaluation Highlights

Achieves average inference latency of 4.46 seconds per query on OLMo-2-32B-Instruct responses (~450 tokens), enabling real-time user interaction
Retrieves documents with high relevance to the generation: 14% of retrieved documents are classified as 'high relevance' (BM25 score ≥ 0.7)
Strong alignment between retrieved documents and human relevance judgments (Spearman correlation of 0.73 with GPT-4o-based judging)

Breakthrough Assessment

8/10

First system to demonstrate real-time verbatim tracing against multi-trillion-token corpora. While limited to verbatim matches (not semantic influence), the engineering scale and speed enable a new class of interactive analysis tools.

⚙️ Technical Details

Problem Definition

Setting: Given an LM output sequence S and a massive training corpus T, find all maximal subsequences of S that appear verbatim in T.

Inputs: Language model output response S (tokenized)

Outputs: List of maximal matching text spans and their source documents from the training data

Pipeline Flow

Step 1: Maximal Span Search (Parallel Suffix Array Query)
Step 2: Span Filtering (Uniqueness Scoring)
Step 3: Document Retrieval (Snippet Extraction)
Step 4: Merging & Presentation
Step 5: Reranking (BM25)

System Modules

Maximal Span Finder (Search & Retrieval)

Identify all text spans in the output that exist verbatim in the training corpus

Model or implementation: Custom parallel algorithm over infini-gram index

Span Filter (Filtering & Ranking)

Select the most interesting spans to show the user

Model or implementation: Statistical heuristic

Document Retriever (Search & Retrieval)

Fetch actual document text for the selected spans

Model or implementation: infini-gram 'Find' query

Relevance Ranker (Filtering & Ranking)

Order retrieved documents by topical relevance to the user query/response

Model or implementation: BM25 scoring

Novel Architectural Elements

Single-pass O(1) Longest Common Prefix discovery using 0-length Suffix Array return values to inspect neighbor suffixes
Parallelized architecture running independent suffix queries across distributed index shards on high-speed SSDs

Modeling

Base Model: OLMo-2-32B-Instruct (and other OLMo variants)

Training Method: Not reported in the paper

Training Data:

Pre-training data (Dolma v1.7, 2.7T tokens)
Mid-training data (Dolma v1.7, 1.8T tokens)
Post-training data (Tulu 3 mix, SFT+DPO+RLVR)

Compute: Inference server: 64 vCPUs, 256GB RAM, 40TB SSD (Google Cloud Platform)

Comparison to Prior Work

vs. Influence Functions: OLMoTrace is purely lexical/verbatim (heuristic) but scales to trillions of tokens in real-time, whereas influence functions are computationally intractable at this scale
vs. RAG: OLMoTrace is post-hoc (analyzes output after generation) and strictly searches the *training distribution*, not an external knowledge base
vs. Search Engines: Indexes the specific version of data the model saw during training (including pre/mid/post-training splits) rather than the live web

Limitations

Only finds verbatim matches; cannot detect semantic influence or paraphrased content
Retrieval is limited to 10 documents per span, which may miss sources if a phrase is extremely common
Heuristic relevance ranking (BM25) may not always align perfectly with true causal influence
Requires access to the full tokenized training data, which is only possible for open-weights/open-data models

Reproducibility

Code: https://github.com/allenai/infinigram-api

Available: Code for the infini-gram engine (https://github.com/allenai/infinigram-api), live demo (https://playground.allenai.org). Training data for OLMo models is open. Missing: Exact scripts for the specific BM25 tuning and UI frontend code are not explicitly linked, though the core backend is open.

📊 Experiments & Results

Evaluation Setup

Latency benchmarking and human/LLM-based relevance evaluation of retrieved documents

Benchmarks:

Internal User Conversations (Real-world chatbot interactions) [New]

Metrics:

Inference Latency (seconds)
Relevance Score (0-3 scale, Human & GPT-4o)
Statistical methodology: Spearman correlation used to compare Human and GPT-4o rankings

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Internal User Conversations	Average Latency	Not reported in the paper	4.46	Not reported in the paper
Internal User Conversations	Human Relevance Score (First Doc)	Not reported in the paper	1.90	Not reported in the paper
Internal User Conversations	GPT-4o Relevance Score (First Doc)	Not reported in the paper	1.82	Not reported in the paper

Experiment Figures

The logic for finding the Longest Common Prefix (LCP) using the Suffix Array in one step

Distributions of span lengths and relevance scores

Main Takeaways

Verbatim matching effectively uncovers training data sources for facts, creative writing, and math solutions
The majority of retrieved documents (96.7%) come from pre-training data, with very few from post-training (SFT/DPO)
Long unique spans (low unigram probability) are better search queries than simply the longest spans
Real-time tracing on trillion-scale data is feasible with SSD-based suffix arrays

📚 Prerequisite Knowledge

Prerequisites

Understanding of Suffix Arrays for string search
Basic knowledge of tokenization in LLMs
Familiarity with BM25 for information retrieval

Key Terms

suffix array (SA): A data structure that stores all suffixes of a text in lexicographical order, allowing fast substring search

infini-gram: A search engine engine built on suffix arrays that efficiently counts and locates query strings in massive corpora (trillions of tokens)

maximal matching span: A sequence of tokens in the output that appears in the training data and cannot be extended left or right while maintaining a match

span unigram probability: The product of the unigram probabilities of all tokens in a span; used to measure how 'surprising' or unique a span is

BM25: A ranking function used in information retrieval to estimate the relevance of documents to a given search query based on term frequency and document length

LCP: Longest Common Prefix—the length of the shared initial sequence between two strings

SFT: Supervised Fine-Tuning—a training phase using labeled instruction-following data

DPO: Direct Preference Optimization—a training method to align models with human preferences using paired data

RLVR: Reinforcement Learning via Verification Rules—a post-training method likely used for math/logic reasoning (implied by context)