| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Average across 4 tasks (Quarel, StrategyQA, OpenBookQA, QASC) | Accuracy improvement | Not reported in the paper | Not reported in the paper | +2% to +3% |
| Average across tasks | Robustness improvement (faithfulness) | Not reported in the paper | Not reported in the paper | +4.5% |
| Out-of-distribution test sets | Accuracy improvement | Not reported in the paper | Not reported in the paper | +2.6% |