| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparative performance against the standard PPO policy baseline across four tasks, highlighting the benefits of guided search. | ||||
| OpenWebText (Sentiment) | Success Rate Increase | 0 | 30 | +30 |
| RealToxicityPrompts | Toxicity Reduction | 100 | 66 | -34 |
| QA Benchmarks | Usefulness | 100 | 112 | +12 |
| HH-RLHF | Human Evaluation Win Rate | 0 | 5 | +5 |