WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries

📝 Paper Summary

Hallucination evaluation Factuality benchmarking

WildHallucinations evaluates LLM factuality using 7,919 real-world entities mined from user chat logs, revealing that models hallucinate significantly more on entities without Wikipedia pages.

Core Problem

Existing factuality benchmarks rely heavily on Wikipedia entities, failing to cover the diverse, niche, and non-Wikipedia topics that real users actually ask chatbots about.

Why it matters:

Benchmarks based solely on Wikipedia overestimate model performance because training data often over-represents Wikipedia content
Real-world users seek information on niche entities (people, finance) where hallucination rates are higher and more dangerous
Current evaluations do not account for the long tail of rare entities where models are most prone to failure

Concrete Example: A user might ask about a specific financial entity or a person without a Wikipedia page. While an LLM might recite a Wikipedia biography perfectly, it may hallucinate details about this lesser-known entity because it lacks the strong parametric knowledge present for Wikipedia topics.

Key Novelty

Evaluation on 'Wild' Entities via Automated Fact-Checking

Extracts entities directly from real-world user-chatbot conversations (WildChat) rather than curating from Wikipedia
Constructs ad-hoc knowledge sources by scraping top-10 web search results for each entity, enabling verification of non-Wikipedia subjects
Evaluates factuality using atomic claim decomposition (FActScore) against these retrieved web documents

Architecture

The data construction and evaluation pipeline for WildHallucinations.

Evaluation Highlights

GPT-4o and GPT-3.5 achieve the highest WildFActScore-Strict, outperforming other models by ~6 percentage points
Retrieval-augmented models (Sonar-Large) can perform worse than their base models (Llama-3-70B) despite having access to web search
Factuality drops significantly for all models on entities without Wikipedia pages; GPT-4o shows a major performance gap between Wiki vs. non-Wiki entities

Breakthrough Assessment

8/10

Strong contribution by shifting evaluation to real-world usage distributions. Highlighting the Wiki vs. non-Wiki gap is critical for understanding true model reliability.

⚙️ Technical Details

Problem Definition

Setting: Open-ended entity description generation and automated fact-checking

Inputs: Prompt: 'In a paragraph, could you tell me what you know about [entity]?'

Outputs: A generated paragraph describing the entity, which is then decomposed into atomic claims for verification

Pipeline Flow

Entity Extraction (GPT-3.5/GPT-4o filtering)
Knowledge Source Construction (Google Search)
Model Generation (Inference)
Automated Fact-Checking (FActScore adaptation)

System Modules

Entity Extractor (Data Construction)

Identify proper nouns from WildChat user turns and filter out common nouns or ambiguous terms

Model or implementation: GPT-3.5 and GPT-4o

Knowledge Builder (Data Construction)

Create a reference knowledge base for each entity since many lack Wikipedia pages

Model or implementation: Google Custom Search JSON API

Generator

Generate a descriptive paragraph about the entity

Model or implementation: Various LLMs (e.g., Llama-3, GPT-4o, Command R)

Fact Checker

Verify the generated text against the retrieved web pages

Model or implementation: FActScore pipeline

Novel Architectural Elements

Dynamic knowledge source construction via live web search for evaluation ground truth (opposed to static Wikipedia dumps)
WildFActScore-Strict metric designed to penalize both hallucinations and excessive abstention simultaneously

Modeling

Base Model: Evaluated 15 models including Llama-3 (8B/70B), Mistral, Mixtral, Command R/R+, Gemini 1.5, Claude 3, GPT-3.5/4o, Perplexity Sonar

Comparison to Prior Work

vs. FActScore: WildHallucinations uses real web search results as ground truth, covering 52% non-Wikipedia entities
vs. Standard Benchmarks: Mined from real user logs (WildChat) rather than academic curation, reflecting actual usage distributions

Limitations

Dependency on commercial search engines (Google API) for ground truth construction
Evaluation costs are high due to using LLMs for atomic fact decomposition and verification
Knowledge source reliability assumes top-10 search results are factual and sufficient
Strict metric (WildFActScore-Strict) may penalize justifiable abstention on truly unknowable entities

Reproducibility

Code: https://huggingface.co/datasets/wentingzhao/WildHallucinations

Data and benchmark publicly available on Hugging Face. Code availability for the evaluation pipeline is implied via the dataset release but specific repo URL is not explicitly in the main text (dataset link provided). Evaluation relies on proprietary models (GPT-4o) and search APIs (Google).

📊 Experiments & Results

Evaluation Setup

Single-turn entity description generation

Benchmarks:

WildHallucinations (Long-form factuality generation) [New]

Metrics:

WildFActScore (percentage of supported atomic facts in non-abstained responses)
WildFActScore-Strict (1 if all facts correct, 0 if any wrong or abstained)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance comparison showing closed-source models generally outperforming open weights, with unexpected results for RAG models.
WildHallucinations	WildFActScore-Strict	0.45	0.55	+0.10
WildHallucinations	WildFActScore-Strict	0.45	0.38	-0.07
Domain-specific analysis reveals significant variance in difficulty.
WildHallucinations	WildFActScore (Computing)	78.4	95.5	+17.1
WildHallucinations	WildFActScore-Strict	0.68	0.50	-0.18

Experiment Figures

Distribution of entities across domains (left) and perplexity/frequency (right).

Performance comparison (WildFActScore-Strict) on entities with vs. without Wikipedia pages.

Main Takeaways

Models consistently hallucinate more on entities without Wikipedia pages, suggesting over-reliance on Wikipedia in training data
Retrieval (RAG) helps reduce hallucinations compared to no retrieval, but is not a silver bullet; some RAG models (Sonar) underperform their base models
Performance varies heavily by domain: Computing/Geography are 'easier' (high scores), while People/Finance are 'harder' (high hallucination rates)
Claude 3 Haiku achieves high standard WildFActScore by abstaining frequently, but performs poorly on WildFActScore-Strict

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM hallucination and factuality
Familiarity with RAG (Retrieval-Augmented Generation)
Knowledge of automated evaluation metrics for text generation

Key Terms

FActScore: A metric that breaks long text into atomic claims and verifies each claim against a knowledge source to calculate a factuality percentage

WildChat: A dataset of 1 million real-world user-chatbot interactions used as the source for mining entities

atomic claim: A single, indivisible piece of information extracted from a longer text (e.g., 'Obama was born in Hawaii') used for precise verification

perplexity: A measurement of how surprised a model is by a sequence of text; used here to approximate how rare an entity is (higher perplexity = rarer entity)

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

WildFActScore-Strict: A stricter metric defined in this paper that assigns a score of 1 only if ALL atomic facts are correct, and 0 if ANY fact is wrong or the model abstains

WildFActScore: The percentage of atomic facts in a response supported by the knowledge source, averaged only over non-abstaining responses

abstention: When a model refuses to answer a query (e.g., 'I don't know about this person') rather than generating potentially false information

Google Custom Search JSON API: A tool used to programmatically retrieve search results from Google, used here to build knowledge sources for entities