LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

📝 Paper Summary

Factual Consistency Evaluation Benchmark Creation

Most LLMs struggle with factual reasoning and detecting inconsistencies in summaries when tested on a new, high-quality 10-domain benchmark (SUMM EDITS) created via a reproducible protocol.

Core Problem

Existing benchmarks for factual consistency are often plagued by low label reliability and simplicity, while LLMs that appear accurate often fail to correctly explain their reasoning or detect subtle errors.

Why it matters:

Unreliable benchmarks prevent accurate measurement of model progress, with manual analysis revealing 6%+ mislabeled samples in popular datasets like AggreFact
Trusting LLMs for critical tasks (like summarizing medical records) requires them to not just classify correctly but to reason correctly about facts, which current metrics often fail to capture
Crowd-sourced annotations often lack reproducibility (low inter-annotator agreement), making it hard to distinguish model failure from label noise

Concrete Example: In the AggreFact benchmark, GPT-4 correctly identified inconsistencies in 101 summaries that were labeled 'consistent' by the dataset creators, proving the dataset itself was flawed. Conversely, models like LLaMA-13B might guess the right binary label ('inconsistent') but provide unrelated justifications (e.g., complaining about the summary format rather than the facts).

Key Novelty

SUMM EDITS Protocol & Benchmark

A 3-step protocol: Verify a seed summary, generate many atomic edits (using LLMs) that introduce specific errors, and have a human verify only the edits, ensuring high efficiency and reproducibility
Creation of SUMM EDITS: A 10-domain benchmark (Sales, Legal, News, etc.) where models must detect inconsistencies in minimal edits, estimating human performance at ~91% while models lag behind

Architecture

The 3-step SUMM EDITS protocol for creating consistent benchmarks

Evaluation Highlights

GPT-4 achieves 82.4% balanced accuracy on SUMM EDITS, outperforming the best specialized non-LLM (QAFactEval at 65.7%) but still trailing human performance (90.9%)
Most open-source LLMs perform near random chance (e.g., LLaMA-13B at 50-52%) on standard benchmarks when analyzed for reasoning, not just binary accuracy
Inter-annotator agreement on SUMM EDITS is ~0.92 (Cohen's Kappa), significantly higher than prior benchmarks like DialSummEval (0.67)

Breakthrough Assessment

9/10

Establishes a new standard for factual consistency benchmarks with extremely high reproducibility. Exposes the 'reasoning gap' in LLMs that previous metrics masked.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of factual consistency between a document and a generated summary

Inputs: Document D and Summary S

Outputs: Label: Consistent or Inconsistent (and optionally a natural language explanation)

Pipeline Flow

Seed Verification (Human checks D+S pair)
Edit Generation (LLM generates variations)
Edit Annotation (Human labels variations)

System Modules

Seed Verifier (Data Creation)

Ensure base summary is flawless and consistent before editing

Model or implementation: Human Annotator

Edit Generator (Data Creation)

Create diverse, atomic modifications to the seed summary

Model or implementation: GPT-3.5-Turbo

Edit Annotator (Data Creation)

Label each edited summary as Consistent, Inconsistent, or Borderline

Model or implementation: Human Annotator

Novel Architectural Elements

Protocol decoupling seed verification from edit verification to increase annotator speed (20x cost-effective) and agreement (0.9+ IAA)

Modeling

Base Model: Various (GPT-4, GPT-3.5, Claude, PaLM-2, LLaMA, etc.) evaluated as judges

Reproducibility

Code: https://github.com/salesforce/factualNLG

📊 Experiments & Results

Evaluation Setup

Zero-shot binary classification of summary consistency across 10 domains

Benchmarks:

SUMM EDITS (Factual Inconsistency Detection) [New]
FactCC (Factual Inconsistency Detection (Synthetic))
AggreFact (Factual Inconsistency Detection (SOTA models))

Metrics:

Balanced Accuracy
Explanation Quality (Manual Analysis)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
On the new SUMM EDITS benchmark, most LLMs perform poorly, with only the largest models beating specialized baselines, and all trailing human performance.
SUMM EDITS (Overall)	Balanced Accuracy	65.7	82.4	+16.7
SUMM EDITS (Overall)	Balanced Accuracy	90.9	82.4	-8.5
SUMM EDITS (Overall)	Balanced Accuracy	65.7	59.8	-5.9
SUMM EDITS (Overall)	Balanced Accuracy	65.7	71.3	+5.6
Analysis of FactCC reveals that high accuracy does not imply correct reasoning; manual evaluation of explanations shows high failure rates.
FactCC	% Correct Explanations	0	32	+32
FactCC	% Correct Explanations	0	58	+58

Experiment Figures

Distribution of explanation types (Correct, Partially Correct, Unrelated, Incorrect, No Explanation) for various LLMs on FactCC

Leaderboard on SUMM EDITS benchmark comparing LLMs and non-LLMs to Human Performance

Main Takeaways

Accuracy metrics can be deceptive: Models like GPT-3.5 perform well on binary classification but often for the wrong reasons (unrelated or incorrect explanations)
Specialized, smaller models (QAFactEval, SummaC) remain competitive with many LLMs (like Claude v1.3, Cohere) on challenging benchmarks
The SUMM EDITS protocol drastically improves annotator agreement (0.92) compared to previous efforts, proving that checking atomic edits is more reliable than judging full summaries
Most LLMs fail to generalize factual reasoning across domains, performing near random chance on complex domains like Shakespeare or Sales in SUMM EDITS

📚 Prerequisite Knowledge

Prerequisites

Understanding of abstractive summarization
Familiarity with hallucination/consistency issues in NLG
Basic knowledge of LLM prompting strategies (Zero-shot, CoT)

Key Terms

Inconsistency Detection (ID): The task of determining whether a summary contains facts not supported by or contradicting the source document

QAFactEval: A specialized non-LLM metric that checks consistency by generating questions from the summary and verifying if the document answers match

SUMM EDITS: The new benchmark proposed in this paper, consisting of document-summary pairs with atomic edits labeled for consistency

Inter-Annotator Agreement (IAA): A statistical measure (like Cohen's Kappa) of how much multiple human annotators agree on labels, used here to validate benchmark quality

Atomic Edits: Small, localized changes to a text (like swapping a date or entity) rather than rewriting the whole text, used to create controlled test cases

Chain-of-Thought (CoT): A prompting strategy asking the model to generate step-by-step reasoning before the final answer

Balanced Accuracy: The arithmetic mean of sensitivity (true positive rate) and specificity (true negative rate), used to evaluate performance on imbalanced datasets