Evaluation Setup
Generative tasks including Fact-Checking, Reading Comprehension, and Commonsense Reasoning.
Benchmarks:
- FEVER (Fact Verification)
- Hover (Multi-hop Fact Verification)
- QuAC (Conversational Question Answering)
- CommonsenseQA (Commonsense Reasoning)
Metrics:
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main results comparing CFMAD against baselines on fact verification and reasoning datasets. |
| Hover |
Accuracy |
69.1 |
82.2 |
+13.1
|
| FEVER |
Accuracy |
68.6 |
82.4 |
+13.8
|
| CommonsenseQA |
Accuracy |
71.0 |
79.2 |
+8.2
|
| QuAC |
Accuracy |
51.4 |
63.8 |
+12.4
|
| Ablation studies validating the necessity of counterfactual reasoning and debate. |
| CommonsenseQA |
Accuracy |
69.5 |
79.2 |
+9.7
|
| CommonsenseQA |
Accuracy |
73.4 |
79.2 |
+5.8
|
Main Takeaways
- CFMAD effectively mitigates the overconfidence issue by forcing the exploration of counterfactuals.
- The debate mechanism is crucial; generated abductions for incorrect answers can be plausible, and the critic helps expose their flaws.
- Performance improves with the number of debate rounds, generally saturating around 2-3 rounds.
- The method is robust across different backbone models (validated on Llama-2-70b-chat as well).