CoVe: Chain-of-verification reduces hallucination in LLMs

📝 Paper Summary

Hallucination suppression Self-Correction/Reasoning

Chain-of-Verification (CoVe) reduces hallucinations by prompting LLMs to plan verification questions, answer them independently to avoid bias, and produce a revised response based on those checks.

Core Problem

Large Language Models (LLMs) often generate plausible but incorrect factual information (hallucinations), particularly for rare facts or in longform generation where exposure bias exacerbates errors.

Why it matters:

Scaling model size or data does not fully resolve hallucination, especially for tail distribution facts.
Models often repeat their own hallucinations when verifying them if the verification step attends to the original incorrect response.

Concrete Example: When asked to list politicians born in NY, ChatGPT generates a list including Hillary Clinton (incorrect). CoVe generates the question 'Where was Hillary Clinton born?', answers 'Chicago' independently, and removes her from the final list.

Key Novelty

Chain-of-Verification (CoVe)

Splits verification into four steps: (1) Draft response, (2) Plan verification questions, (3) Execute verifications independently, (4) Generate final verified response.
Crucially uses a 'factored' approach where verification questions are answered without attending to the original draft to prevent repeating hallucinations.

Architecture

Overview of the Chain-of-Verification (CoVe) method illustrating the four-step process on a specific example (politicians born in NY).

Evaluation Highlights

Increases FACTSCORE on longform biography generation to 71.4 (CoVe factor+revise) from 55.9 (Few-shot baseline), outperforming ChatGPT (58.7).
Doubles precision on Wikidata list-based questions from 0.17 (Llama 65B Few-shot) to 0.36 (CoVe two-step).
Improves F1 on closed-book MultiSpanQA by 23% over the few-shot baseline (0.39 -> 0.48) by increasing both precision and recall.

Breakthrough Assessment

8/10

Significant performance gains across multiple tasks by simply changing the reasoning structure, outperforming larger commercial models (ChatGPT) without external retrieval tools.

⚙️ Technical Details

Problem Definition

Setting: Closed-book Question Answering and Longform Text Generation

Inputs: User query q (e.g., 'Tell me a bio of <entity>')

Outputs: Verified text response free of factual hallucinations

Pipeline Flow

Baseline Response Generation (Draft)
Verification Planning (Generate Questions)
Verification Execution (Answer Questions)
Final Verified Response Generation

System Modules

Baseline Generator

Generates an initial response to the query using standard few-shot prompting

Model or implementation: Llama 65B (base model)

Verification Planner (Verification)

Generates specific validation questions to check facts in the baseline response

Model or implementation: Llama 65B (base model)

Verification Executor (Verification)

Answers the generated verification questions. In 'Factored' mode, this is done independently without seeing the Baseline Response.

Model or implementation: Llama 65B (base model)

Response Reviser

Generates the final response by incorporating the verification Q&A pairs to correct inconsistencies

Model or implementation: Llama 65B (base model)

Novel Architectural Elements

Factored verification pipeline: Explicitly removing the original draft from the context during the verification answering step to prevent error propagation
Factor+Revise extension: An explicit intermediate step that cross-checks consistency between the draft and verification answers before final generation

Modeling

Base Model: Llama 65B (pretrained base model, not instruction tuned)

Reproducibility

Prompt templates are provided in the paper (Tables 5-9). Llama 65B and Llama 2 70B models are publicly available. Evaluation datasets (Wikidata, MultiSpanQA, FACTSCORE) are established benchmarks. Code repository is not explicitly linked.

📊 Experiments & Results

Evaluation Setup

Closed-book generation across three task types: list-based generation, QA, and longform biography generation.

Benchmarks:

Wikidata List Questions (List-based entity generation) [New]
Wiki-Category List (QUEST) (Set generation from categories)
MultiSpanQA (Closed-book Reading Comprehension)
Longform Biographies (Longform text generation)

Metrics:

Precision (micro-averaged)
F1 Score
FACTSCORE (atomic fact verification)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on list-based tasks shows CoVe significantly reduces hallucinations (negatives) compared to baselines.
Wikidata List Questions	Precision	0.17	0.36	+0.19
Wiki-Category List	Precision	0.12	0.22	+0.10
Results on Closed-book QA and Longform generation demonstrate CoVe's ability to improve correctness in more complex formats.
MultiSpanQA	F1	0.39	0.48	+0.09
Longform Biographies	FACTSCORE	55.9	71.4	+15.5
Longform Biographies	FACTSCORE	60.8	63.7	+2.9

Experiment Figures

FACTSCORE performance distribution across head, torso, and tail facts (sorted by rarity) for different models.

Main Takeaways

Factored CoVe consistently outperforms Joint CoVe, confirming that preventing the model from attending to its original hallucinated draft during verification is crucial.
Shortform verification questions are answered more accurately (approx. 70% accuracy) than the original longform generation (approx. 17% accuracy), validating the core premise.
Explicitly reasoning about consistency (Factor+Revise) yields the largest gains in longform generation (+7.7 FACTSCORE over standard Factored).
Standard instruction tuning (Llama 2 Chat) and Chain-of-Thought prompting were less effective at reducing hallucinations than the CoVe approach on these tasks.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and Prompting
Hallucination in text generation
Chain-of-Thought (CoT) reasoning

Key Terms

CoVe: Chain-of-Verification—a four-step method where a model drafts, plans verification questions, answers them, and revises its output

Factored Verification: An execution strategy where verification questions are answered in separate contexts that do not contain the original potentially hallucinated draft

Joint Verification: An execution strategy where planning and answering happen in a single prompt/context (prone to repeating errors)

FACTSCORE: An automated metric that decomposes longform generations into atomic facts and verifies them using a retrieval-augmented model

Exposure Bias: The phenomenon where a model's generation errors accumulate because it conditions on its own previous (potentially incorrect) tokens

Few-shot prompting: Providing a model with a small number of example input-output pairs in the context window to guide its behavior