Evaluation Setup
Online fine-tuning on math problems with no external teacher (self-generated data)
Benchmarks:
- AIME 2024 (Math Competition)
- AIME 2025 (Math Competition)
- AMC 2023 (Math Competition)
- MATH500 (Math Problem Solving)
- OlympiadBench (Olympiad Math)
- Minerva Math (Math Reasoning)
Metrics:
- Average Accuracy across benchmarks
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Average (32B Model Settings) |
Percentage of total gain |
100 |
20 |
-80
|
Main Takeaways
- NFT consistently outperforms RFT, demonstrating that negative feedback (mistakes) contains valuable signal ignored by standard supervised learning.
- The performance gap between RFT and RL is largely due to SL's neglect of negative data; bridging this gap allows SL to match RL methods like GRPO.
- Negative feedback becomes increasingly important for larger models (32B vs 7B), suggesting that as models improve at memorization, reflection on errors becomes the new bottleneck.
- RFT remains a very strong baseline, accounting for ~80% of the possible gains in the tested configurations.
- Prioritizing harder questions (lower correctness rate) via weighting enhances performance.