ChatGPT as a Factual Inconsistency Evaluator for Text Summarization

📝 Paper Summary

LLM-as-a-Judge Factual Inconsistency Detection Text Summarization Evaluation

ChatGPT outperforms existing metrics in evaluating factual consistency of text summaries in a zero-shot setting, though it exhibits biases toward lexical overlap and struggles with reasoning.

Core Problem

Existing methods for evaluating factual consistency in summarization (like NLI or QA-based metrics) are computationally expensive, rely on annotated data, or correlate poorly with human judgments.

Why it matters:

Pre-trained language models often generate summaries with 'hallucinations'—content not supported by the source document—undermining trust in automated summarization.
Current state-of-the-art evaluation metrics (like FactCC, SummaC) require training on large datasets or complex pipelines, yet still show limited agreement with human annotators.

Concrete Example: In a CoGenSumm example, a summary claims 'half of' a group was affected, while the article implies the whole group. Traditional metrics and ChatGPT (in simple mode) fail to catch this due to high lexical overlap, but ChatGPT correctly identifies the error when forced to rank it against a correct summary.

Key Novelty

Zero-Shot Factual Inconsistency Evaluator using ChatGPT

Reframe factual evaluation as three distinct tasks for an LLM: binary entailment (yes/no), summary ranking (A vs B), and scalar rating (1-10), without any model fine-tuning.
Employ Zero-Shot Chain-of-Thought (CoT) prompting to trigger reasoning capabilities, asking the model to 'think step by step' before delivering a consistency verdict.

Architecture

A bar chart breaking down the Balanced Accuracy of ChatGPT ZS-COT into Sensitivity (Recall of True Positives) and Specificity (Recall of True Negatives) across six datasets.

Evaluation Highlights

Outperforms SummaC ZS by +3.9% accuracy on CoGenSumm and +1.6% on SummEval in binary entailment tasks using Chain-of-Thought prompting.
Achieves 85.2% accuracy in Summary Ranking, surpassing both supervised baselines like FactCC (70.0%) and human performance (83.9%) on the tested dataset.
Dominates Pearson correlation with human judgments on the FRANK dataset (0.70 vs next best 0.20) for consistency rating.

Breakthrough Assessment

7/10

Strong empirical evidence that off-the-shelf LLMs outperform specialized fine-tuned metrics for factuality. However, the study identifies critical flaws (reasoning errors, lexical bias) preventing full reliance.

⚙️ Technical Details

Problem Definition

Setting: Given a source document D and a generated summary S, determine if S is factually consistent with (entailed by) D.

Inputs: A source article and a candidate summary.

Outputs: Binary label (Consistent/Inconsistent), a ranking preference, or a scalar score (1-10).

Pipeline Flow

Input Construction (Prompt + Article + Summary)
ChatGPT Inference (Zero-shot or Zero-shot CoT)
Output Parsing (Extract Yes/No, A/B, or Score)

System Modules

Prompt Constructor

Formats the task as a natural language question

Model or implementation: N/A (Prompt Template)

Evaluator

Generates the evaluation verdict

Model or implementation: ChatGPT (gpt-3.5-turbo-0301)

Novel Architectural Elements

Application of Zero-shot Chain-of-Thought specifically for factual consistency detection in summarization [not an architectural change, but a prompting strategy application]

Modeling

Base Model: ChatGPT (gpt-3.5-turbo-0301)

Compute: Not reported in the paper (Inference only via API)

Comparison to Prior Work

vs. FactCC/SummaC: ChatGPT requires no task-specific training or fine-tuning (Zero-shot).
vs. QuestEval/DAE: ChatGPT operates as a generative model providing reasoning (with CoT) rather than a numeric score derived from sub-components.
vs. BARTScore [not cited in paper]: ChatGPT uses prompt-based reasoning rather than evaluating the likelihood of generation.

Limitations

Lexical Bias: ChatGPT tends to label summaries with high lexical overlap as consistent, even when semantically incorrect.
False Reasoning: In CoT mode, the model sometimes generates reasoning that contradicts the source text to support a false 'Consistent' label.
Prompt Sensitivity: The model occasionally fails to follow constraints, such as rating a factually correct summary as '1/10' because it missed other information.
Cost/Access: Reliance on closed-source API (OpenAI).

Reproducibility

Prompt templates are provided in the paper. The exact model version (gpt-3.5-turbo-0301) is specified. Code is not provided. Dataset splits follow the SUMMAC benchmark standard.

📊 Experiments & Results

Evaluation Setup

Evaluation on three tasks: Entailment Inference (Binary), Summary Ranking (Pairwise), and Consistency Rating (Likert Scale).

Benchmarks:

SUMMAC Benchmark (Binary Consistency Classification)
Falke et al. (2019) Dataset (Summary Ranking)
SummEval & FRANK (Consistency Rating (Correlation with human judgment))

Metrics:

Balanced Accuracy (bACC)
Ranking Accuracy
Pearson/Spearman/Kendall Correlation
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Entailment Inference (Binary Classification): ChatGPT with CoT generally outperforms baselines, especially on datasets with extractive summaries (CNN/DM based).
CoGenSumm	Balanced Accuracy	70.4	74.3	+3.9
SummEval	Balanced Accuracy	81.7	83.3	+1.6
FactCC	Balanced Accuracy	89.5	79.5	-10.0
Summary Ranking: ChatGPT demonstrates superior ability to distinguish consistent from inconsistent summaries in pairwise comparisons.
Falke et al. (2019)	Ranking Accuracy	83.9	85.2	+1.3
Consistency Rating: ChatGPT correlations with human judgments are significantly higher than traditional metrics.
FRANK	Pearson Correlation	0.20	0.70	+0.50
SummEval	Pearson Correlation	0.32	0.49	+0.17

Main Takeaways

Zero-shot Chain-of-Thought (CoT) prompting significantly boosts performance over standard zero-shot prompting (e.g., +11% on CoGenSumm).
ChatGPT has high specificity (rejects inconsistent summaries well) but lower sensitivity (misses some inconsistencies), often due to reliance on lexical overlap.
Performance drops on highly abstractive summaries (e.g., XSum data) where lexical overlap is low, causing the model to predict inconsistency more often.
Despite failures in binary classification for subtle errors, ChatGPT can often identify the correct summary when presented with a pairwise ranking task, suggesting the signal exists but requires the right retrieval method (prompt).

📚 Prerequisite Knowledge

Prerequisites

Natural Language Inference (NLI) concepts (premise/hypothesis entailment)
Text Summarization evaluation metrics
Zero-shot prompting and Chain-of-Thought (CoT)

Key Terms

NLI: Natural Language Inference—determining if a hypothesis is true given a premise.

Zero-shot CoT: Zero-shot Chain-of-Thought—prompting a model to 'think step by step' without providing training examples, to improve reasoning.

FactCC: A BERT-based metric fine-tuned on synthetic data to classify summary consistency.

SummaC: A benchmark and NLI-based method for consistency detection that aggregates sentence-pair scores.

QAGS: Question Answering and Generation for Summarization—an evaluation metric that checks if questions generated from the summary can be answered by the source.

Lexical overlap: The degree to which two texts share the exact same words.

Hallucination: When a model generates information not present in or contradicted by the source text.