| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison results showing RAFT superior performance against baselines across multiple datasets. | ||||
| HotpotQA | Accuracy | 28.18 | 63.43 | +35.25 |
| HotpotQA | Accuracy | 32.56 | 63.43 | +30.87 |
| Torch Hub | Accuracy | 20.59 | 96.94 | +76.35 |
| HuggingFace | Accuracy | 43.18 | 74.59 | +31.41 |
| HotpotQA | Accuracy | 57.78 | 63.43 | +5.65 |