| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main results comparing RLAIF and RLHF against the SFT baseline show that both RL methods significantly improve over supervised fine-tuning. | ||||
| Reddit TL;DR | Win Rate vs SFT | 50 | 71 | +21 |
| Helpful Dialogue | Win Rate vs SFT | 50 | 63 | +13 |
| Direct head-to-head comparisons between RLAIF and RLHF show no significant difference in quality, indicating AI feedback is a viable substitute. | ||||
| Reddit TL;DR | Win Rate (RLAIF vs RLHF) | 50 | 50 | 0 |
| Harmlessness evaluation where RLAIF actually outperforms RLHF and SFT. | ||||
| Harmless Dialogue | Harmless Rate | 76 | 88 | +12 |