Evaluation Setup
Mathematical reasoning on standard benchmarks
Benchmarks:
- MATH500 (Mathematical Problem Solving)
- AIME24 (Mathematical Competition)
- AMC23 (Mathematical Competition)
- Minerva (Mathematical Reasoning)
Metrics:
- Accuracy (Pass@1)
- Token-Length (Average tokens)
- ACT (Accuracy Contribution per hundred Tokens)
- Self-reflection keyword frequency
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Training Set (MATH-Lighteval subset) |
Data Utilization Rate |
Inefficient (<100%) |
100% |
Positive
|
| Not specified |
Performance Balance |
Longer responses |
Optimal (alpha=0.01) |
Improved
|
Main Takeaways
- FGO achieves 100% data utilization in training, resolving a key inefficiency in GRPO where identical group rewards lead to zero learning signal.
- The method successfully mitigates entropy collapse, maintaining higher trajectory-level entropy during training compared to baselines.
- The hyperparameter alpha=0.01 effectively balances the trade-off between chain-of-thought length and accuracy; aggressive compression (high alpha) hurts performance, while no compression (alpha=0) retains redundant reasoning.
- Self-reflection capabilities (measured by keywords like 'wait', 'alternatively') are preserved even as the reasoning chain is compressed.