| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparison against standard RL (Outcome Supervision) baseline across multiple tasks using Llama2-7B. | ||||
| Average (8 tasks) | Accuracy | 45.0 | 49.1 | +4.1 |
| GSM8K | Accuracy | 42.5 | 46.7 | +4.2 |
| Program-based reasoning results (generating code to solve math problems) on GSM8K. | ||||
| GSM8K | Accuracy | 48.2 | 59.6 | +11.4 |