Evaluation Setup
Post-training of reasoning LLMs on math and code generation tasks
Benchmarks:
- AIME 2025 (Competition Math)
- HMMT 2025 (Feb/Nov) (Competition Math)
- LiveCodeBench v5 (Code Generation)
Metrics:
- Pass@k (k=1 to 256)
- Statistical methodology: Unbiased estimator for Pass@k (Chen et al., 2021) used with 10 (math) or 20 (code) independent rollouts
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| LiveCodeBench v5 |
Training Generations |
3.0 |
1.0 |
-2.0
|
| LiveCodeBench v5 |
Max Policy Lag (Gradient Steps) |
4 |
400 |
+396
|
Main Takeaways
- OAPL outperforms GRPO with Importance Sampling across multiple math competition benchmarks (AIME, HMMT, BRUMO).
- The method enables stable training even when the inference policy is significantly lagged (up to 400 gradient updates behind), allowing for highly asynchronous and efficient architectures.
- Unlike GRPO, OAPL does not suffer from entropy collapse and shows better test-time scaling (Pass@k improvements up to k=256).
- Being strictly on-policy is not necessary for effective RL post-training of reasoning models.