| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison on LLaMA-7B alignment shows RAFT achieving better reward scores and GPT-4 win rates compared to PPO. | ||||
| HH-RLHF | Reward Score (Test Set) | -1.25 | -1.09 | +0.16 |
| HH-RLHF | GPT-4 Win Rate | 43.0 | 57.0 | +14.0 |
| HH-RLHF | Perplexity | 5.38 | 5.53 | +0.15 |
| Diffusion model experiments demonstrate RAFT's ability to optimize aesthetic rewards. | ||||
| Stable Diffusion Aesthetic | Aesthetic Score | 4.72 | 5.60 | +0.88 |
| Stable Diffusion Aesthetic | Aesthetic Score | 5.21 | 5.60 | +0.39 |