Evaluation Setup
Models trained on MATH dataset and evaluated on math (in-domain) and code/instruction-following (out-of-domain) benchmarks.
Benchmarks:
- MATH (Mathematical Reasoning)
- GSM8K (Grade School Math)
- LiveCodeBench (LCB) (Code Generation)
- CRUXEval-O (Code Reasoning)
- AlpacaEval 2.0 (Instruction Following)
Metrics:
- Accuracy (Pass@1)
- Length Controlled Win Rate (AlpacaEval)
- Statistical methodology: Mann-Whitney U tests used to analyze self-certainty distributions.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Intuitor enables models to learn non-trivial reasoning capabilities from scratch (or near-scratch) without ground truth, significantly outperforming baselines on transfer tasks. |
| LiveCodeBench |
Accuracy |
0 |
9.9 |
+9.9
|
Main Takeaways
- Intrinsic 'self-certainty' rewards are sufficient to drive learning of complex reasoning behaviors, matching supervised RL on in-domain math tasks.
- Intuitor generalizes significantly better than outcome-based RL (GRPO) to out-of-domain tasks (Code Generation), likely because it rewards the reasoning process (confidence) rather than just the final answer.
- Qualitative analysis shows emergent 'pre-reasoning' behaviors: models trained with Intuitor spontaneously generate detailed natural language explanations before writing code to increase their own confidence.
- Online reward computation is critical; offline rewards lead to reward hacking where the model generates gibberish to inflate confidence scores.