Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models

📝 Paper Summary

Hallucination detection Model testing/evaluation Fact-conflicting hallucination (FCH)

Drowzee automatically generates diverse test cases by mutating factual knowledge using logic programming rules and detects hallucinations by verifying the logical consistency of LLM reasoning against ground truth.

Core Problem

Detecting Fact-Conflicting Hallucination (FCH) is difficult because manually maintaining up-to-date benchmarks is labor-intensive, and validating the reasoning behind LLM answers (not just the final output) is inherently complex.

Why it matters:

Static manual benchmarks rapidly become obsolete as knowledge evolves, limiting detection adaptability and scalability
LLMs may produce correct final answers via faulty reasoning (false understanding), masking underlying hallucination tendencies that pose security risks
Existing naive detection methods (like string matching) struggle to verify complex logical relations in generated text

Concrete Example: If a user asks about Haruki Murakami winning a Nobel Prize, a model might correctly say 'No' but justify it with a hallucinated reason (e.g., 'He won in 2016'), which simple answer-matching would miss. Drowzee's logic-based approach detects this inconsistency.

Key Novelty

Logic-Programming-Aided Metamorphic Testing for FCH

Uses logic programming (Prolog-style rules) to transform seed facts (e.g., 'A is bigger than B') into new, complex test cases (e.g., 'Is B smaller than A?') with automatically derived ground truth.
Validates LLM outputs not by exact string matching, but by extracting semantic structures and comparing their logical relationships to the ground truth using specialized 'oracles'.
Automates the entire pipeline from knowledge crawling to test case generation and result verification, removing the need for human annotation.

Architecture

Conceptual workflow of Drowzee compared to manual benchmarking. Shows the transformation of seed facts into complex questions via logic programming.

Evaluation Highlights

Detected hallucination rates ranging from 24.7% to 59.8% across six major LLMs (including GPT-4 and Llama-2)
Identified that lack of logical reasoning capability is the primary contributor to Fact-Conflicting Hallucination (FCH) issues in LLMs
Demonstrated that model editing on identified hallucinations (fewer than 1000 edits) effectively mitigates specific FCH instances on a small scale

Breakthrough Assessment

8/10

Significant methodology for automating hallucination benchmarks. By moving from static Q&A to logic-based dynamic generation, it solves the stale-benchmark problem, though reliance on external knowledge bases (Wikipedia) remains a dependency.

⚙️ Technical Details

Problem Definition

Setting: Automated testing of Large Language Models for Fact-Conflicting Hallucination (FCH)

Inputs: Seed factual knowledge (entity-relation triples) crawled from Wikipedia

Outputs: Hallucination detection results (Boolean) and generated test cases with ground truth

Pipeline Flow

Knowledge Crawler (fetches facts)
Test Case Generator (applies logic rules to mutate facts)
LLM Querying (prompts models for answers + reasoning)
Result Verifier (Semantic-aware oracles check consistency)

System Modules

Knowledge Crawler (Data Preparation)

Harvests seed facts from Wikipedia to build the initial knowledge base

Model or implementation: Not applicable

Test Case Generator (Data Preparation)

Transforms seed facts into new facts and question-answer pairs using logic reasoning rules

Model or implementation: Prolog-based Logic Engine

Result Verifier

Validates LLM reasoning by comparing semantic structures of the output against the ground truth

Model or implementation: Semantic-aware Metamorphic Oracles

Novel Architectural Elements

Integration of logic programming rules (transitive, inverse, composite) directly into the test case generation pipeline to create 'mutant' inputs with known ground truth
Dual semantic-aware oracle design that validates the *reasoning process* (via justification analysis) rather than just the final answer token

Modeling

Base Model: Evaluated on 6 LLMs: GPT-4, GPT-3.5-turbo, Llama-2-7b-chat, Llama-2-13b-chat, Vicuna-7b-v1.5, Vicuna-13b-v1.5

Limitations

Reliance on the accuracy of the source knowledge base (Wikipedia); if the source is wrong, the ground truth is wrong
Logic rules implemented (5 types) may not cover all possible types of logical reasoning or linguistic complexity
The semantic-aware oracles themselves might have edge cases where they misinterpret highly ambiguous LLM outputs

Reproducibility

Code: https://github.com/ningke-li/Drowzee

📊 Experiments & Results

Evaluation Setup

Zero-shot Question Answering with request for reasoning/justification

Benchmarks:

Drowzee Generated Benchmark (Fact verification and reasoning) [New]

Metrics:

Hallucination Rate (HR)
Pass Rate (non-hallucinated)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation across six LLMs reveals significant hallucination rates when tested against Drowzee's logic-based test cases.
Drowzee Benchmark	Hallucination Rate (range)	0.0	24.7 to 59.8	+24.7 to +59.8

Experiment Figures

Example of Prolog-based logic programming structure used for data mutation

Main Takeaways

LLMs struggle significantly with temporal concepts and out-of-distribution knowledge, which are frequent triggers for hallucinations
Lack of logical reasoning capability is identified as the primary cause of FCH, rather than just simple knowledge retrieval failures
Logic-based test case generation is highly effective at triggering hallucinations that might remain dormant in simple fact-retrieval benchmarks
Model editing showed promise for fixing specific facts on a small scale (<1000 edits), suggesting a path for mitigation

📚 Prerequisite Knowledge

Prerequisites

Basic logic programming concepts (facts, rules, predicates)
Understanding of Metamorphic Testing
Familiarity with LLM prompting and generation

Key Terms

FCH: Fact-Conflicting Hallucination—when an LLM generates content contradicting established real-world facts

Metamorphic Testing: A testing technique that verifies systems by checking if changes to inputs produce expected changes in outputs based on defined relations (metamorphic relations)

Oracle: A mechanism in software testing that determines whether the output of a system is correct (pass/fail)

Logic Programming: A programming paradigm based on formal logic (like Prolog), where programs consist of facts and rules to infer conclusions

Predicate: A logic statement expressing a relation (e.g., bigger(X, Y)) used to define facts and rules

Horn clause: A logical formula typically used in logic programming, consisting of a head predicate implied by a body of predicates

Model Editing: Techniques to directly modify specific knowledge within a trained model's weights without full retraining