| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Zero-shot performance comparisons on standard hallucination benchmarks show ANAH-v2 surpassing larger models. | ||||
| HaluEval | Accuracy | 73.34 | 81.54 | +8.20 |
| HalluQA | Accuracy | 82.47 | 94.44 | +11.97 |
| ANAH (In-domain) | Accuracy | 73.40 | 89.24 | +15.84 |
| Hallucination mitigation results using the annotator as a re-ranker. | ||||
| HaluEval | NLI (Natural Language Inference) | 25 | 37 | +12 |