Evaluation Setup
Mathematical reasoning tasks with exact match reward
Benchmarks:
- SVAMP (Verbal arithmetic/math word problems)
- Multi-digit Multiplication (Arithmetic calculation (3-digit)) [New]
Metrics:
- Accuracy (Exact Match)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Token usage efficiency statistics showing how much of the trajectory contributes to gradient updates. |
| Reasoning Tasks |
Tokens contributing to loss (%) |
100 |
30-50 |
-50 to -70
|
| Reasoning Tasks |
Tokens contributing to loss (%) |
100 |
<5 |
> -95
|
Main Takeaways
- Both S-GRPO and T-SPMO raise SVAMP accuracy from 46% (base) to over 70%, while full-token GRPO fails to improve performance under LoRA settings.
- Sparse token optimization acts as an implicit regularizer, preventing overfitting or collapse when using low-capacity adapters (LoRA).
- T-SPMO enables extremely sparse updates (<5% of tokens) by focusing on pivotal transitions identified via prefix tries.