ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models

📝 Paper Summary

Hallucination Evaluation Factual Knowledge Benchmarking

ERBench leverages relational database constraints (functional dependencies and foreign keys) to automatically generate complex, multi-hop questions and verify both the correctness of LLM answers and their underlying rationales.

Core Problem

Existing hallucination benchmarks are either manual (expensive, not scalable) or based on simple knowledge graph triples (simplistic questions), failing to evaluate complex reasoning chains or verify the specific rationale behind an answer.

Why it matters:

LLMs frequently generate correct answers for the wrong reasons (hallucinated rationales), which current benchmarks often miss
Evaluating multi-hop reasoning usually requires expensive human annotation or results in rigid, unmodifiable datasets
Continuous evaluation is difficult because static benchmarks become outdated as facts change, whereas databases are naturally updated

Concrete Example: If asked 'Are Firenze and Florence the same city?', an LLM might answer 'Yes' (correct answer) but justify it by saying 'Because both are in the US' (hallucinated rationale). Standard benchmarks checking only the 'Yes' token would fail to catch this hallucination.

Key Novelty

Database-Driven Benchmark Generation

Uses Functional Dependencies (FDs) to verify rationales: if attributes X determine Y, the LLM must mention the correct intermediate Y values in its reasoning
Uses Foreign Key Constraints (FKCs) to construct multi-hop questions: joining tables allows generating deep questions where intermediate steps are strictly defined by the database schema

Architecture

Conceptual workflow of ERBench using a Movie database example. It illustrates how an Entity-Relationship diagram translates to schema/records, then to Functional Dependencies (FDs) and Foreign Keys (FKs), which are finally used to construct verifiable questions.

Evaluation Highlights

Benchmarked 8 major LLMs (including GPT-4, Llama2-70B, Claude-3) across 55 database domains
GPT-4 achieves the highest Answer-Rationale Accuracy (AR) but still exhibits significant hallucination in negated questions
ERBench's automated rationale verification matches human inspection with >95.5% accuracy, validating the FD-based approach

Breakthrough Assessment

8/10

A clever, highly scalable approach to a major bottleneck in LLM evaluation (verifying reasoning chains). By piggybacking on existing database structures, it solves the ground-truth generation problem elegantly.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of LLM factual hallucination on two levels: answer correctness and rationale correctness

Inputs: Natural language questions generated from database schemas (binary or multiple-choice)

Outputs: LLM response containing a final answer and a generated explanation (rationale)

Pipeline Flow

Schema Analysis (Extract FDs and FKCs)
Question Generation (Convert records to Binary/Multiple-Choice prompts)
LLM Inference (Get answer + rationale)
Automated Verification (Check answer token + key-phrase matching in rationale)

System Modules

Question Generator

Converts database records into natural language questions using templates based on FDs

Model or implementation: Template-based / LLM-assisted rephrasing

Constraint Verifier

Checks if the LLM's rationale contains the necessary attribute values defined by the FDs

Model or implementation: String matching / Entity Resolution heuristics

Novel Architectural Elements

Utilization of database integrity constraints (FDs and FKCs) as a ground-truth mechanism for verifying natural language reasoning chains

Modeling

Base Model: Evaluated multiple models: GPT-3.5, GPT-4, Llama2-70B-Chat, Gemini-Pro, Claude-3-Sonnet, Mistral-7B-Instruct

Comparison to Prior Work

vs. Head-to-Tail: ERBench evaluates the *rationale* (reasoning process), not just the final answer correctness
vs. Knowledge Graph Benchmarks: ERBench supports complex multi-hop questions via joins and ensures data integrity via database constraints, whereas KG questions are often simplistic triples

Limitations

Relies on the underlying database integrity constraints being correct; if the DB has errors, the benchmark has errors
Entity resolution in rationale verification is heuristic-based and may miss valid but distinct entity mentions
Requires access to structured databases with well-defined schemas, which may not exist for all domains
The 'Unsure' option relies on model instruction following, which varies by model capability

Reproducibility

Code: https://github.com/microsoft/ERBench

Publicly available code at https://github.com/microsoft/ERBench. Uses 5 public datasets (Movie, Soccer, Airport, Music, Book). Specific FDs used for verification are listed in the paper. LLM API versions (e.g., specific GPT-4 snapshot) are not explicitly detailed, but standard model names are provided.

📊 Experiments & Results

Evaluation Setup

Zero-shot QA with requirement to explain reasoning

Benchmarks:

ERBench (Custom) (Factual QA (Binary and Multiple-Choice)) [New]

Metrics:

Answer Accuracy (A)
Rationale Accuracy (R)
Answer-Rationale Accuracy (AR)
Hallucination Rate (H)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Verification of ERBench's automated rationale checking against human inspection.
ERBench (Internal Audit)	Verification Correctness	100	95.5	-4.5
Performance of major LLMs on single-hop binary questions across 5 datasets (Movie, Soccer, Airport, Music, Book).
ERBench (Binary Questions)	Answer Accuracy (A)	0.582	0.771	+0.189
ERBench (Binary Questions)	Rationale Accuracy (R)	0.222	0.729	+0.507
ERBench (Binary Questions)	Hallucination Rate (H)	0.456	0.198	-0.258
Performance on Multi-hop questions (Movie and Soccer datasets), testing reasoning chains.
ERBench (2-hop Movie)	Answer-Rationale Accuracy (AR)	0.339	0.627	+0.288

Main Takeaways

Rationale accuracy (R) is consistently lower than answer accuracy (A), indicating models frequently get the right answer for the wrong reasons (hallucinated rationales).
Models like Llama2 and Mistral often default to 'No' answers, artificially inflating their performance on negated questions while failing basic factual recall.
Multi-hop questions reveal a significant performance drop compared to single-hop, as errors snowball through the reasoning chain.
ERBench effectively distinguishes between 'lucky guesses' and true knowledge by enforcing rationale verification via functional dependencies.

📚 Prerequisite Knowledge

Prerequisites

Relational Database concepts (Schema, Entity-Relationship model)
Basic LLM prompting strategies (Chain-of-Thought, Few-Shot)

Key Terms

Functional Dependency (FD): A constraint in a database where the value of one set of attributes (X) uniquely determines the value of another set of attributes (Y)

Foreign Key Constraint (FKC): A field in one table that links to the primary key of another table, ensuring referential integrity and allowing tables to be joined

Entity-Relationship (ER) Model: A data model that describes a database in terms of entities (objects) and relationships between them

Multi-hop question: A question that requires multiple steps of reasoning or retrieving information from connected pieces of data to answer

Hallucination Rate (H): The portion of responses that are incorrect, excluding those where the LLM explicitly admits uncertainty

Rationale Accuracy (R): A metric measuring whether the LLM's explanation contains the correct intermediate values required to derive the answer

Answer-Rationale Accuracy (AR): A strict metric requiring both the final answer to be correct AND the rationale to contain the correct inferred values

Snowball effect: The phenomenon where an error in an early step of a multi-step reasoning process leads to compounding errors in subsequent steps