| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparative analysis of standard PPO against the proposed 'Length-Only PPO' (lppo) baseline on downstream win-rates. The small deltas indicate that optimizing for length alone accounts for most of the performance gain. | ||||
| WebGPT | Win-rate vs SFT | 58 | 56 | -2 |
| RLCD | Win-rate vs SFT | 63 | 64 | +1 |
| Reward decomposition analysis showing how much of the reward increase is strictly due to length shifts versus actual quality improvements within length buckets. | ||||
| WebGPT | Non-Length Reward Gain (NRG) % | 100 | 2 | -98 |