Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base

📝 Paper Summary

Dynamic Benchmarking Automated Red Teaming Factuality Evaluation

SEA formulates knowledge deficiency discovery as a stochastic optimization problem, iteratively retrieving new error-inducing candidates similar to previous failures using a relation directed acyclic graph.

Core Problem

Exhaustively evaluating LLMs against full-scale knowledge bases to find factual errors is computationally prohibitive, especially for closed-weight models with strict query budgets.

Why it matters:

LLMs frequently hallucinate factual information (e.g., misattributing citations or getting capitals wrong), which is dangerous in high-stakes domains like healthcare and law.
Static benchmarks suffer from data leakage and cannot cover the vast, evolving nature of human knowledge.
Existing automated discovery methods often rely on internal model probabilities (inaccessible for closed models) or lack efficient exploration strategies.

Concrete Example: A model might correctly answer common questions about France but fail on specific, nuanced facts in a long-tail document. Random sampling misses these rare failures, while SEA uses the initial failure to find semantically similar documents (e.g., about obscure French history) that likely trigger more errors.

Key Novelty

Stochastic Error Ascent (SEA)

Frames error discovery as an optimization loop: instead of random probing, it uses current failure cases to retrieve semantically similar 'error-prone' candidates from a massive corpus.
Constructs a Relation DAG (Directed Acyclic Graph) to model error propagation, linking source errors to new candidates and pruning low-impact paths to save budget.
Uses a hierarchical retrieval strategy (document-level then paragraph-level) to efficiently navigate massive knowledge bases like Wikipedia without exhaustive scanning.

Architecture

The Stochastic Error Ascent (SEA) framework workflow.

Evaluation Highlights

Uncovers 40.7× more knowledge errors than Automated Capability Discovery (ACD) on DeepSeek-V3 under the same budget.
Identifies 26.7% more errors than AutoBencher on average across 8 models, with a 61.5% relative improvement on DeepSeek-V3.
Reduces the cost-per-error by 599× compared to ACD and 9× compared to AutoBencher.

Breakthrough Assessment

8/10

Significantly improves efficiency in red-teaming closed models for factual errors. The formulation as stochastic optimization offers a scalable alternative to static benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Maximize the error rate of a closed-weight model f_close over a subset of paragraphs S selected from a knowledge base K, subject to a query budget C.

Inputs: A massive knowledge base K (Wikipedia), a closed-weight model f_close, and a budget C.

Outputs: An optimal subset of paragraphs S^ that induces high error rates in f_close.

Pipeline Flow

Initialization: Sample initial batch B from K
Evaluation: Test f_close on B to identify source errors
Loop: Hierarchical Retrieval -> Update Relation DAG -> Prune Sources -> Test New Batch

System Modules

Question Generator

Generates multiple-choice QA pairs from a given paragraph to test the target model

Model or implementation: gpt-4o

Hierarchical Retriever (Optimization / Search)

Finds new candidates semantically similar to current errors

Model or implementation: mGTE (Sentence Transformer)

Relation DAG Manager (Optimization / Search)

Maintains the graph of error propagation and prunes low-quality source nodes based on cumulative error

Model or implementation: Algorithmic Component

Target Model (f_close)

The model being red-teamed/evaluated

Model or implementation: Various (e.g., GPT-4o, DeepSeek-V3)

Novel Architectural Elements

Integration of a Relation DAG (Directed Acyclic Graph) within the retrieval loop to track and prune error discovery paths based on cumulative success (error yield).
Feedback loop where the 'search query' for the next batch is explicitly defined by the *failures* of the previous batch.

Modeling

Base Model: Various target models: GPT-4o, GPT-4o-mini, o1-mini, DeepSeek-V3, DeepSeek-R1, Qwen2.5-72B-Instruct, Llama-3.3-70B-Instruct

Compute: Budget is defined by API calls/tokens (e.g., 20,000 API calls). No training is performed; this is an inference-only evaluation framework.

Comparison to Prior Work

vs. ACD: SEA uses an external knowledge base and error-driven retrieval, finding significantly more errors (up to 55x) than ACD's internal-only search.
vs. AutoBencher: SEA dynamically updates the search direction based on *errors* rather than just retrieving topic-related pages, resulting in higher error rates and lower cost per error.
vs. EvalTree: SEA focuses on large-scale external knowledge retrieval rather than hierarchical capability decomposition [not cited in paper].

Limitations

Relies on the quality of the generator model (GPT-4o) for ground truth; if the generator hallucinates, evaluation may be noisy.
Computational cost is still non-trivial as it requires iterative querying of the target model.
The 'semantic similarity' assumption (that errors cluster semantically) might not hold for all types of knowledge failures.

Reproducibility

Code availability is not explicitly provided in the paper text. The knowledge base is derived from Wikipedia (7.1M documents). Evaluation uses GPT-4o as a fixed generator. Hyperparameters (temperature, thresholds) are specified.

📊 Experiments & Results

Evaluation Setup

Dynamic benchmarking where the system searches for errors in a target LLM using Wikipedia as a knowledge source.

Benchmarks:

Wikipedia Knowledge Base (Factuality / Question Answering) [New]

Metrics:

Number of discovered errors
Error Rate (proportion of questions answered incorrectly)
Cost per Error (API calls or budget unit per error found)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison with Automated Capability Discovery (ACD) shows massive improvements in total errors found under fixed budget.
Wikipedia (Dynamic)	Number of Errors (Ratio SEA/ACD)	1.0	55.83	+54.83
Wikipedia (Dynamic)	Cost per Error Reduction	1.0	599.0	598.0
Comparison with AutoBencher on error rate efficiency.
Wikipedia (Dynamic)	Error Rate	0.26	0.42	+0.16
Wikipedia (Dynamic)	Average Error Rate	0.30	0.38	+0.08
Wikipedia (Dynamic)	Cost per Error Reduction	1.0	9.0	8.0

Experiment Figures

Convergence analysis showing Cumulative Error and Per-step Error over 20 iterations.

Main Takeaways

SEA consistently discovers more errors than baselines by actively following error gradients via semantic similarity.
The method is highly cost-effective, drastically reducing the number of queries needed to find a specific number of failures.
Error analysis reveals strong intra-family correlations (e.g., GPT-4o models share failure patterns), but o1-mini behaves differently.
Models like DeepSeek-V3 struggle on subsets where GPT-4o performs well, highlighting model-specific knowledge gaps.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and hallucination
Vector retrieval and semantic similarity (embeddings)
Stochastic optimization concepts

Key Terms

SEA: Stochastic Error Ascent—the proposed framework for iteratively discovering model errors by optimizing for failure-inducing inputs.

Relation DAG: Relation Directed Acyclic Graph—a graph structure tracking dependencies between source errors and newly retrieved candidates to model error propagation.

mGTE: A specific sentence transformer model used for generating embeddings to calculate semantic similarity.

closed-weight model: An LLM where internal parameters/gradients are inaccessible (e.g., GPT-4), allowing only API-based inference.

ACD: Automated Capability Discovery—a baseline method that uses an LLM's internal knowledge to generate tasks without external retrieval.

AutoBencher: A baseline method that iteratively retrieves pages to build benchmarks but relies on static input topics.