Evaluation Setup
Zero-shot hallucination detection on multiple benchmarks
Benchmarks:
- ANAH (In-domain) (Fine-grained hallucination annotation)
- HaluEval (Hallucination detection in QA)
- HalluQA (Chinese hallucination benchmark)
Metrics:
- Accuracy
- F1 Score
- Natural Language Inference (NLI) metric (for mitigation)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Zero-shot performance comparisons on standard hallucination benchmarks show ANAH-v2 surpassing larger models. |
| HaluEval |
Accuracy |
73.34 |
81.54 |
+8.20
|
| HalluQA |
Accuracy |
82.47 |
94.44 |
+11.97
|
| ANAH (In-domain) |
Accuracy |
73.40 |
89.24 |
+15.84
|
| Hallucination mitigation results using the annotator as a re-ranker. |
| HaluEval |
NLI (Natural Language Inference) |
25 |
37 |
+12
|
Main Takeaways
- Iterative self-training effectively scales dataset size and quality simultaneously without human intervention beyond the seed set
- A specialized 7B model can outperform GPT-4 on specific fine-grained annotation tasks when trained on high-quality, self-generated data
- Decomposing the annotation task into analytical steps (Fact -> Reference -> Type) improves accuracy compared to direct classification
- The resulting annotator serves as an effective reward model for mitigation strategies like re-ranking