Evaluation Setup
Evaluation of general reasoning capabilities using models trained on UltraLogic data
Benchmarks:
- Not reported in the provided text (General Reasoning)
Metrics:
- Success Rate
- Training Efficiency (Convergence speed)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- Task diversity is a more critical driver for enhancing general reasoning capabilities than mere data scaling
- Bipolar Float Reward (BFR) outperforms binary rewards by effectively penalizing imperfect reasoning paths, leading to faster convergence
- The 'Difficulty Matching Phenomenon' confirms RL is most effective within a 'Zone of Proximal Development' where task difficulty aligns with model capacity