Evaluation Setup
Validation of 78 complex cases from India's Civil Services examination requiring factual accuracy and causal consistency.
Benchmarks:
- Civil Services Exam Dataset (Complex factual and causal reasoning QA) [New]
Metrics:
- Precision
- Inter-model agreement (Cohen's Kappa)
- Statistical methodology: Calculation of 95% Confidence Intervals and p-values for improvement over baseline.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Civil Services Exam Dataset |
Precision |
73.1 |
93.9 |
+20.8
|
| Civil Services Exam Dataset |
Precision |
73.1 |
95.6 |
+22.5
|
| Civil Services Exam Dataset |
Precision |
73.1 |
86.9 |
+13.8
|
Main Takeaways
- Requiring unanimous consensus dramatically improves precision (from 73.1% to 95.6%) by filtering out hallucinations where models disagree.
- There is a trade-off between precision and recall; the system is conservative, prioritizing error avoidance (2 false positives vs 19 false negatives in 3-model setup).
- High inter-model agreement (Kappa > 0.76) suggests models generally converge on truth but maintain enough independence to catch errors.
- Diminishing returns observed between 2-model and 3-model configurations (1.7% improvement, p=0.265).