QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

📝 Paper Summary

Modularized RAG pipeline Hallucination suppression

QuCo-RAG determines when to retrieve by checking if entities in the question or generated answer are rare or non-co-occurring in the pre-training corpus, replacing unreliable model confidence scores with objective statistical evidence.

Core Problem

Existing dynamic RAG methods rely on internal model signals (logits, entropy) to detect hallucinations, but these are unreliable because LLMs are often ill-calibrated and confidently wrong.

Why it matters:

LLMs frequently exhibit 'confident hallucinations,' assigning high probability to factually incorrect statements, which fools entropy-based detectors
Theoretical work suggests even perfectly calibrated models must hallucinate on rare facts to maintain statistical consistency
Current methods like DRAGIN fail to trigger retrieval for obvious errors (e.g., wrong names) while flagging safe tokens, wasting compute on unnecessary searches

Concrete Example: When generating the director of a movie, the baseline DRAGIN assigns low uncertainty (high confidence) to the incorrect name 'Mario Camerini', failing to trigger retrieval. Conversely, it assigns high uncertainty to the common word 'Il' in the question, triggering useless retrieval. QuCo-RAG detects that 'Mario Camerini' never co-occurs with the movie title in the pre-training corpus, correctly flagging the error.

Key Novelty

Corpus-Grounded Uncertainty Quantification

Shifts uncertainty estimation from subjective internal model states (logits/entropy) to objective external statistics derived directly from the pre-training corpus
Uses two specific statistical signals: low-frequency entities in the question (indicating input ignorance) and zero co-occurrence between entities in the generated answer (indicating lack of evidential support)
Leverages efficient suffix-array infrastructure (Infini-gram) to query these statistics from trillion-token corpora in milliseconds during inference

Architecture

The QuCo-RAG inference pipeline illustrating the two-stage uncertainty quantification process.

Evaluation Highlights

+5.4 to +12.0 Exact Match (EM) improvement over state-of-the-art baselines (SeaKR, ETC, DRAGIN) on OLMo-2 models across HotpotQA and 2WikiMultihopQA
Achieves +14.1 EM gain on 2WikiMultihopQA with Qwen2.5-32B by using OLMo-2's corpus as a proxy, demonstrating strong cross-model transferability
Outperforms GPT-5-chat's built-in web search by +5.5 to +8.7 EM points, proving that precise statistical verification beats generic agentic search for specific fact retrieval

Breakthrough Assessment

9/10

Establishes a new paradigm for dynamic RAG by grounding decisions in objective corpus data rather than unreliable model logits. The cross-model transferability (using one model's corpus to check another) is a highly practical and significant finding.

⚙️ Technical Details

Problem Definition

Setting: Dynamic Retrieval-Augmented Generation where the system must decide at each step whether to retrieve external information

Inputs: Natural language question Q

Outputs: Generated response y = (s1, s2, ... sN)

Pipeline Flow

Pre-Generation Assessment (Check Question Entities → Retrieve if Rare)
Generation (Sentence-by-Sentence)
Runtime Verification (Extract Triplets → Check Co-occurrence → Retrieve & Regenerate if Zero Co-occurrence)

System Modules

Entity Extractor

Identify key entities in the user question

Model or implementation: Lightweight extractor (fine-tuned Qwen2.5-0.5B)

Infini-gram Interface (Retrieval & Selection)

Query the pre-training corpus for entity frequency and co-occurrence counts

Model or implementation: Suffix array index over 4T token corpus (OLMo-2 pre-training data)

Triplet Extractor

Extract knowledge triplets (head, relation, tail) from generated sentences for verification

Model or implementation: Specialized 0.5B model distilled from GPT-4o-mini (fine-tuned Qwen2.5-0.5B-Instruct)

Retriever (Retrieval & Selection)

Retrieve documents from Wikipedia when uncertainty is detected

Model or implementation: BM25 (sparse retrieval)

Novel Architectural Elements

Dual-stage statistical verification loop: Pre-generation check for input ignorance + Runtime check for output hallucination
Integration of an external corpus-indexing engine (Infini-gram) directly into the generation loop as a verifier
Proxy-corpus architecture: Using statistics from one model's open corpus (OLMo) to verify generations from completely different black-box models (GPT-4, Llama-3)

Modeling

Base Model: OLMo-2 (7B, 13B, 32B), Llama-3-8B-Instruct, Qwen2.5-32B-Instruct, GPT-4.1/5

Training Method: Distillation / Supervised Fine-Tuning (for the auxiliary extractor model only)

Adaptation: Full fine-tuning of the small 0.5B extractor model

Training Data:

40K annotated examples for triplet extraction generated via GPT-4o-mini using in-context learning

Key Hyperparameters:

entity_frequency_threshold_tau_entity: 1000
co_occurrence_threshold_tau_cooc: 1
co_occurrence_window_size: 1000 tokens

Compute: Inference uses NVIDIA H200 GPUs. Infini-gram requires CPU and disk storage (index of 4T tokens).

Comparison to Prior Work

vs. FLARE/DRAGIN/SeaKR: QuCo-RAG uses external corpus statistics (frequency/co-occurrence) instead of internal model signals (logits/entropy) which are often miscalibrated
vs. Self-RAG: Does not require fine-tuning the main LLM; operates as a plug-and-play inference-time module
vs. FacTool [not cited in paper]: Similar in verifying facts, but QuCo-RAG focuses on dynamic retrieval triggering via corpus statistics rather than post-hoc fact-checking
+ 1 more
vs. CRAG (Corrective RAG) [not cited in paper]: QuCo-RAG uses pre-training statistics for correction, whereas CRAG typically uses a trained evaluator

Limitations

Dependency on a proxy corpus (OLMo-2) for closed models; assumes knowledge distribution overlap
Requires an Infini-gram index, which is storage-intensive (though efficient to query)
Co-occurrence is asymmetric: zero co-occurrence reliably indicates risk, but high co-occurrence doesn't guarantee correctness
Effectiveness decreases if the pre-training corpus does not cover the domain (though transferability results suggest robustness)

Reproducibility

Code: https://github.com/ZhishanQ/QuCo-RAG

Code is publicly available. Infini-gram API is public and hosts the OLMo-2 corpus index. The auxiliary triplet extractor model is distilled from GPT-4o-mini; training data (40K examples) and the distilled model weights are required for full replication (paper implies code release covers this).

📊 Experiments & Results

Evaluation Setup

Open-domain multi-hop QA using Wikipedia as the retrieval source

Benchmarks:

2WikiMultihopQA (Multi-hop reasoning QA)
HotpotQA (Multi-hop reasoning QA)
PubMedQA (Biomedical QA (Domain Generalization))

Metrics:

Exact Match (EM)
F1 score
Number of retrievals per question
Number of tokens generated
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on OLMo-2 models (Matched Corpus Setting) shows QuCo-RAG outperforming all baselines.
2WikiMultihopQA	EM	25.3	32.7	+7.4
HotpotQA	EM	29.7	35.3	+5.6
2WikiMultihopQA	EM	29.6	41.7	+12.1
HotpotQA	EM	29.7	35.0	+5.3
HotpotQA	EM	36.0	46.8	+10.8
Transferability experiments using OLMo-2 corpus as a proxy for other models (Qwen, Llama, GPT).
2WikiMultihopQA	EM	35.9	50.0	+14.1
2WikiMultihopQA	EM	60.0	64.6	+4.6
2WikiMultihopQA	EM	51.0	59.7	+8.7
Domain generalization on PubMedQA.
PubMedQA	Accuracy	63.4	66.4	+3.0

Experiment Figures

Efficiency-Performance trade-off plots (EM vs. Tokens, EM vs. Latency, EM vs. Retrieval Count) on HotpotQA.

EM scores broken down by entity frequency bins in the pre-training corpus.

Main Takeaways

Internal uncertainty signals (entropy, logits) are unreliable for hallucination detection; corpus-based statistics provide a much stronger signal.
The method is effectively model-agnostic: statistics from one open corpus (OLMo-2) transfer remarkably well to completely different models (GPT-4, Llama-3), likely due to high overlap in web-scale pre-training data.
Ablation studies show that the 'Runtime Claim Verification' (co-occurrence check) is the most critical component, contributing more to performance than the pre-generation check.
QuCo-RAG is highly efficient, achieving the best performance with fewer retrievals (1.70 vs >2.79 for FLARE) and fewer tokens generated compared to baselines.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation)
Familiarity with LLM calibration issues (confidence vs. accuracy)
Basic knowledge of n-gram statistics and suffix arrays

Key Terms

Infini-gram: A search engine engine using suffix arrays to compute n-gram frequencies over massive corpora (e.g., 4 trillion tokens) with millisecond latency

EM: Exact Match—a metric measuring if the generated answer string exactly matches the ground truth

Dynamic RAG: RAG systems that adaptively decide when to retrieve during generation, rather than always retrieving once at the start

Hallucination: Generated content that is factually incorrect or unfaithful to the source, often produced with high confidence

Co-occurrence: The frequency with which two entities appear together within a specific window (e.g., a document) in the pre-training corpus

SFT: Supervised Fine-Tuning—training a model on labeled examples to follow instructions

BM25: Best Matching 25—a probabilistic information retrieval function that ranks documents based on the query terms appearing in each document

Zero co-occurrence: When two entities never appear together in the same context window in the entire training corpus, strongly suggesting the model has no evidence connecting them

OLMo-2: Open Language Model 2—a fully open-source LLM family where the pre-training data is publicly available, allowing direct statistical analysis