Evaluation Setup
Multi-step QA and Math reasoning with tool access (Search, Calculator)
Benchmarks:
- HotPotQA (Multi-hop Question Answering)
- GSM8K (Mathematical Reasoning)
- BeerQA (Multi-hop QA)
- MuSiQue (Multi-hop QA)
- CofCA (Code/Reasoning)
Metrics:
- Accuracy (Exact Match or equivalent)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| SWiRL outperforms standard Supervised Finetuning (SFT) and Base models across multiple datasets when trained on in-domain data. |
| HotPotQA |
Accuracy |
57.3 |
64.4 |
+7.1
|
| GSM8K |
Accuracy |
73.3 |
76.4 |
+3.1
|
| Cross-task generalization experiments show that training on one domain (e.g., QA) improves performance on distinct domains (e.g., Math). |
| GSM8K |
Accuracy |
62.9 |
73.5 |
+10.6
|
| HotPotQA |
Accuracy |
51.1 |
55.8 |
+4.7
|
| Ablation on filtering strategies reveals that Process Filtering (step-wise soundness) is superior to Outcome Filtering (final answer correctness) for SWiRL. |
| HotPotQA |
Accuracy |
57.3 |
64.4 |
+7.1
|
Main Takeaways
- Process filtering is critical for RL: Models learn best from trajectories with sound reasoning steps, even if the final outcome is incorrect. This contrasts with SFT, which requires correct outcomes.
- Strong cross-task generalization: Learning granular reasoning steps transfers between disparate tasks (e.g., Math to QA), suggesting the model learns a general 'how to reason' capability rather than just task-specific patterns.
- SWiRL generalizes to out-of-distribution datasets: Training on HotPotQA yields double-digit relative gains on BeerQA (+15.3%), MuSiQue (+11.1%), and CofCA (+14.8%).
- Model scale matters for generalization: While smaller models (2B/9B) improve on in-domain tasks, only the larger 27B model shows strong cross-domain transfer capabilities.