Evaluation Setup
Tested on three generative tasks: Knowledge-based QA, Text Summarization, and Dialogue Generation.
Benchmarks:
- HaluEval (Hallucination Evaluation (QA, Dialogue, Summarization))
Metrics:
- Accuracy
- Precision
- Recall
- F1 Score
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main comparison on HaluEval-QA task showing improvement over baselines. |
| HaluEval-QA |
Accuracy |
77.0 |
85.6 |
+8.6
|
| HaluEval-QA |
Precision |
82.9 |
95.6 |
+12.7
|
| Main comparison on HaluEval-Dialogue task. |
| HaluEval-Dialogue |
Accuracy |
84.4 |
89.2 |
+4.8
|
| Main comparison on HaluEval-Summarization task. |
| HaluEval-Summarization |
Accuracy |
86.6 |
89.8 |
+3.2
|
Main Takeaways
- The Markov Chain-based debate framework consistently outperforms single-agent baselines (ChatGPT) and previous frameworks (Factool) across diverse tasks.
- The method is particularly effective in improving Precision, significantly reducing false positives in hallucination detection.
- Ablation studies (implied by the design, though specific numbers for ablation are in the appendix/analysis) confirm the contribution of the multi-agent structure.