| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison on Qwen2.5-7B-Instruct shows CoBA-RL consistently outperforming GRPO and Knapsack-RL baselines. | ||||
| Average (5 benchmarks) | avg@16 accuracy | 42.24 | 46.78 | +4.54 |
| AIME25 | avg@16 accuracy | 12.71 | 18.33 | +5.62 |
| OLYMPIAD Bench | avg@16 accuracy | 41.33 | 43.11 | +1.78 |
| AMC23 | avg@16 accuracy | Not reported in the paper | Not reported in the paper | +6.72 |
| Computational efficiency analysis comparing the proposed Heap-Based Greedy strategy against a Dynamic Programming baseline. | ||||
| Runtime Simulation | Execution Time (seconds) | 115.05 | 0.124 | -114.926 |