Evaluation Setup
Evaluation on three tasks: Entailment Inference (Binary), Summary Ranking (Pairwise), and Consistency Rating (Likert Scale).
Benchmarks:
- SUMMAC Benchmark (Binary Consistency Classification)
- Falke et al. (2019) Dataset (Summary Ranking)
- SummEval & FRANK (Consistency Rating (Correlation with human judgment))
Metrics:
- Balanced Accuracy (bACC)
- Ranking Accuracy
- Pearson/Spearman/Kendall Correlation
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Entailment Inference (Binary Classification): ChatGPT with CoT generally outperforms baselines, especially on datasets with extractive summaries (CNN/DM based). |
| CoGenSumm |
Balanced Accuracy |
70.4 |
74.3 |
+3.9
|
| SummEval |
Balanced Accuracy |
81.7 |
83.3 |
+1.6
|
| FactCC |
Balanced Accuracy |
89.5 |
79.5 |
-10.0
|
| Summary Ranking: ChatGPT demonstrates superior ability to distinguish consistent from inconsistent summaries in pairwise comparisons. |
| Falke et al. (2019) |
Ranking Accuracy |
83.9 |
85.2 |
+1.3
|
| Consistency Rating: ChatGPT correlations with human judgments are significantly higher than traditional metrics. |
| FRANK |
Pearson Correlation |
0.20 |
0.70 |
+0.50
|
| SummEval |
Pearson Correlation |
0.32 |
0.49 |
+0.17
|
Main Takeaways
- Zero-shot Chain-of-Thought (CoT) prompting significantly boosts performance over standard zero-shot prompting (e.g., +11% on CoGenSumm).
- ChatGPT has high specificity (rejects inconsistent summaries well) but lower sensitivity (misses some inconsistencies), often due to reliance on lexical overlap.
- Performance drops on highly abstractive summaries (e.g., XSum data) where lexical overlap is low, causing the model to predict inconsistency more often.
- Despite failures in binary classification for subtle errors, ChatGPT can often identify the correct summary when presented with a pairwise ranking task, suggesting the signal exists but requires the right retrieval method (prompt).