| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Classification QA results (Balanced Dataset): SAC3-Q significantly outperforms Self-consistency (SC2) on synthetic tasks where models are confidently wrong. | ||||
| Prime Number | AUROC | 65.9 | 99.4 | +33.5 |
| Senator Search | AUROC | 56.1 | 99.7 | +43.6 |
| Classification QA results (100% Hallucinated Dataset): Evaluation on accuracy with a fixed threshold shows SAC3's robustness when the model is always wrong. | ||||
| Prime Number | Accuracy | 48.2 | 99.4 | +51.2 |
| Senator Search | Accuracy | 29.6 | 97.0 | +67.4 |
| Open-domain Generation QA: SAC3 improves detection on realistic QA datasets, though margins are smaller than in synthetic tasks. | ||||
| HotpotQA-halu | AUROC | 74.2 | 88.0 | +13.8 |
| NQ-open-halu | AUROC | 70.5 | 77.2 | +6.7 |
| Model Generalization: SAC3-Q maintains superiority over Self-consistency across GPT-4 and PaLM 2. | ||||
| Senator Search | Accuracy | 18.4 | 61.6 | +43.2 |
| HotpotQA-halu | Accuracy | 75.8 | 82.8 | +7.0 |