Evaluation Setup
Detection and editing of hallucinations in LM-generated text using Wikipedia as a knowledge source
Benchmarks:
- FavaBench (Fine-grained hallucination detection and editing) [New]
Metrics:
- Macor F1 (Fine-grained detection)
- FActScore (Factuality of edited text)
- Binary F1 (Binary detection)
- Statistical methodology: Inter-annotator agreement calculated using Cohen kappa scores
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Fava significantly outperforms baselines on the fine-grained detection task. |
| FavaBench |
Macro F1 |
30.8 |
54.5 |
+23.7
|
| FavaBench |
Macro F1 |
39.8 |
54.5 |
+14.7
|
| In binary detection settings, Fava remains superior to specialized systems and strong LLMs. |
| FavaBench |
Binary F1 |
51.1 |
62.4 |
+11.3
|
| Editing capabilities show Fava improves the factuality of various model outputs. |
| FavaBench (Alpaca 13B outputs) |
FActScore Improvement |
0.0 |
9.3 |
+9.3
|
| FavaBench (ChatGPT outputs) |
FActScore Improvement |
0.0 |
3.3 |
+3.3
|
Main Takeaways
- Fine-grained detection is necessary: over 60% of hallucinations are not simple entity errors (e.g., unverifiable or subjective statements).
- Synthetic data is effective: Training on data where errors are artificially injected by strong models allows a 7B model to outperform GPT-4 on this task.
- Retrieval is crucial: Adding retrieval context significantly aids in detection, but the fine-grained taxonomy allows the model to handle cases where retrieval fails (e.g., marking as 'Unverifiable' or 'Invented').