| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance on Factuality benchmarks (TruthfulQA, TriviaQA, Natural Questions) showing consistent gains over baselines. | ||||
| TruthfulQA | %True*Info | 44.0 | 48.9 | +4.9 |
| TriviaQA | Accuracy | 39.1 | 44.1 | +5.0 |
| Natural Questions | Accuracy | 11.5 | 13.0 | +1.5 |
| Performance on Reasoning benchmarks (StrategyQA, GSM8K) showing significant improvements, particularly where prior methods struggled. | ||||
| GSM8K | Accuracy | 42.8 | 50.1 | +7.3 |
| StrategyQA | Accuracy | 57.8 | 65.9 | +8.1 |
| GSM8K | Accuracy | 31.0 | 38.2 | +7.2 |