Evaluation Setup
Closed-book generation across three task types: list-based generation, QA, and longform biography generation.
Benchmarks:
- Wikidata List Questions (List-based entity generation) [New]
- Wiki-Category List (QUEST) (Set generation from categories)
- MultiSpanQA (Closed-book Reading Comprehension)
- Longform Biographies (Longform text generation)
Metrics:
- Precision (micro-averaged)
- F1 Score
- FACTSCORE (atomic fact verification)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on list-based tasks shows CoVe significantly reduces hallucinations (negatives) compared to baselines. |
| Wikidata List Questions |
Precision |
0.17 |
0.36 |
+0.19
|
| Wiki-Category List |
Precision |
0.12 |
0.22 |
+0.10
|
| Results on Closed-book QA and Longform generation demonstrate CoVe's ability to improve correctness in more complex formats. |
| MultiSpanQA |
F1 |
0.39 |
0.48 |
+0.09
|
| Longform Biographies |
FACTSCORE |
55.9 |
71.4 |
+15.5
|
| Longform Biographies |
FACTSCORE |
60.8 |
63.7 |
+2.9
|
Main Takeaways
- Factored CoVe consistently outperforms Joint CoVe, confirming that preventing the model from attending to its original hallucinated draft during verification is crucial.
- Shortform verification questions are answered more accurately (approx. 70% accuracy) than the original longform generation (approx. 17% accuracy), validating the core premise.
- Explicitly reasoning about consistency (Factor+Revise) yields the largest gains in longform generation (+7.7 FACTSCORE over standard Factored).
- Standard instruction tuning (Llama 2 Chat) and Chain-of-Thought prompting were less effective at reducing hallucinations than the CoVe approach on these tasks.