IRB: Automated Generation of Robust Factuality Benchmarks

📝 Paper Summary

Modularized RAG pipeline

IRB automates the creation of RAG benchmarks by using human-written Wikipedia citations as a factual scaffold and knowledge graphs as an algorithmic scaffold to generate controlled, verifiable, and complex question-answer pairs.

Core Problem

RAG benchmarks suffer from rapid saturation and data contamination as newer models memorize web-scale data, while manual benchmark creation is expensive and pure LLM-generation lacks control and grounding.

Why it matters:

Frontier models often memorize existing static benchmarks, making it impossible to distinguish between parametric knowledge and actual retrieval capabilities
Fully automated generation approaches without explicit grounding often produce unfaithful samples or lack control over question complexity (e.g., multi-hop reasoning)
Maintaining robust benchmarks requires constant, labor-intensive updates to include fresh data that models haven't seen during training

Concrete Example: A purely neural generator might hallucinate a plausible-sounding but false fact. IRB avoids this by extracting a sentence like 'X won Y award in 2024' directly from Wikipedia, verifying the cited URL supports it, and then programmatically transforming it into a multi-hop question via a knowledge graph.

Key Novelty

Factual and Algorithmic Scaffolding for Benchmark Generation

Uses 'factual scaffold': Extracts facts from human-written Wikipedia sentences with citations, treating the cited URLs as ground-truth evidence to ensure grounding
Uses 'algorithmic scaffold': Converts facts into intermediate knowledge graphs to programmatically control question complexity (single-hop, multi-hop, false-premise) and prevent trivial generation

Architecture

The automated question generation pipeline of IRB

Evaluation Highlights

Generated IRB1K benchmark containing 1,000 questions from 2024-2025 Wikipedia articles, challenging frontier models significantly in closed-book settings
Retrieval acts as an 'equalizer': The performance gap between top and bottom models shrinks by ~4x when retrieval is enabled compared to closed-book
Reasoning models (e.g., GPT-5, DeepSeek-R1) show superior reliability in adversarial settings, such as handling false-premise questions and incorrect retrieval contexts

Breakthrough Assessment

8/10

Offers a scalable, controlled solution to the critical problem of benchmark contamination. The dual-scaffold approach balances automation with rigorous factual grounding.

⚙️ Technical Details

Problem Definition

Setting: Automated generation of Question-Answer (QA) pairs for RAG evaluation, where each pair is associated with ground-truth evidence documents

Inputs: A collection of Wikipedia articles

Outputs: A benchmark dataset containing queries, attributes (e.g., hop count), ground-truth answers, and relevance judgments

Pipeline Flow

Fact Extraction: Citing sentences → Keypoints → Groundedness Check
KG Construction: Keypoints → Knowledge Graph → Coverage Check
KG Transformation: Masking/Paraphrasing/False-premise Injection
Question Generation: Step-by-step generation → Answerability Check → Refinement

System Modules

Fact Extractor

Extracts citing sentences from Wikipedia, splits them into atomic segments, and decontextualizes them into self-contained 'keypoints'

Model or implementation: LLM-based (Prompting)

Graph Constructor (Question Synthesis)

Converts text keypoints into a structured Knowledge Graph (triplets)

Model or implementation: LLM-based (Prompting)

Graph Transformer (Question Synthesis)

Modifies the KG to define question difficulty

Model or implementation: Algorithmic / Rule-based

Question Generator (Question Synthesis)

Generates natural language questions from the transformed KG

Model or implementation: LLM-based (Prompting)

Refiner & Validator

Checks if the question has a unique answer and refines phrasing

Model or implementation: LLM-based (Prompting)

Novel Architectural Elements

Two-stage scaffolding architecture: First locking down facts via citation analysis (Factual Scaffold), then controlling complexity via graph manipulation (Algorithmic Scaffold)
Programmatic injection of false premises and multi-hop constraints via graph operations rather than purely prompting an LLM

Modeling

Base Model: Evaluated on GPT-4.1, GPT-5-mini, GPT-5, Llama-3.3-70B, Llama-4-Scout, DeepSeek-R1, Qwen3-Next-80B

Compute: Generation of 1,838 questions cost ~$18 and took 16 hours using GPT-4.1-mini

Comparison to Prior Work

vs. RAGEval: Uses structured knowledge graphs for control vs. purely persona-based prompting
vs. DataMorgana: Explicit grounding in human-selected citations vs. purely neural generation
vs. FreshStack: Uses Wikipedia citing sentences as evidence ground truth vs. StackOverflow Q&A
+ 1 more
vs. Auto-RAG [not cited in paper]: Auto-RAG typically filters existing datasets, while IRB generates new data from scratch with explicit complexity controls

Limitations

Reliance on Wikipedia means domain coverage is limited to encyclopedic knowledge
Groundedness checks can fail due to technical issues like offline websites or dynamic content
Currently restricted to text-only retrieval; does not handle multimodal verification
Graph-based generation can occasionally produce malformed questions if node types are misidentified

Reproducibility

Code: https://github.com/Hozaifa-Bhutta/IRB

Code and data (IRB1K) are open-sourced at https://github.com/Hozaifa-Bhutta/IRB. Generation pipeline relies on OpenAI API (GPT-4.1-mini). Wikipedia dump from September 29, 2025 used.

📊 Experiments & Results

Evaluation Setup

RAG evaluation using the generated IRB1K benchmark (1,000 questions)

Benchmarks:

IRB1K (Factual QA (Single-hop, Multi-hop, False-premise)) [New]

Metrics:

Correctness (LLM-based evaluation)
nDCG@5 (Retriever performance)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Closed-book performance is generally low, confirming the difficulty of the benchmark.
IRB1K (Closed-book)	Correctness	13.00	51.10	+38.10
RAG performance shows retrieval acts as an equalizer, shrinking the gap between models.
IRB1K (RAG)	Correctness	13.00	69.10	+56.10
IRB1K (RAG)	Correctness	69.10	78.45	+9.35
System performance is highly sensitive to retrieval quality.
IRB1K	Correctness	39.81	95.53	+55.72
Retriever performance varies by topic and freshness.
IRB1K	nDCG@5	0.68	0.68	0.00

Main Takeaways

Retrieval acts as an 'equalizer', significantly reducing the correctness gap between weaker and stronger LLMs
Reasoning models (like GPT-5, DeepSeek-R1) are more robust to incorrect retrieval and false premises than non-reasoning models
Retriever quality is the primary bottleneck; ensuring correct retrieval yields larger gains than scaling the generator
Retrievers struggle significantly with fresh information and cross-lingual queries compared to static, monolingual facts

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with Knowledge Graphs (triplets, nodes, relations)
Basic knowledge of LLM prompting and evaluation

Key Terms

RAG: Retrieval-Augmented Generation—systems that improve LLM responses by retrieving relevant external documents

Knowledge Graph (KG): A structured representation of data as a graph where nodes are entities and edges are relationships

Scaffold: A guiding structure or constraint used to control the generation process of an LLM

Factual Scaffold: Using human-written citing sentences as the immutable basis for fact generation

Algorithmic Scaffold: Using a knowledge graph structure to programmatically dictate question type and complexity

nDCG@k: Normalized Discounted Cumulative Gain—a measure of ranking quality that considers position of relevant items

Closed-book setting: Asking the model to answer questions using only its internal training data, without external retrieval

False-premise: A question based on an incorrect assumption (e.g., 'When did the US President visit Mars?')

qrels: Query relevance judgments—annotations indicating which documents are relevant to a specific query