| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Human evaluation results on ELI5 showing WebGPT preference over human baselines. | ||||
| ELI5 | Human Preference Rate | 50 | 56 | +6 |
| ELI5 | Human Preference Rate | 50 | 69 | +19 |
| TruthfulQA results comparing WebGPT to GPT-3 baselines. | ||||
| TruthfulQA | Percentage Truthful | 49 | 75 | +26 |
| TruthfulQA | Percentage Truthful & Informative | 22 | 54 | +32 |
| Comparison of training methods (RL vs Rejection Sampling). | ||||
| ELI5 (Internal Validation) | Preference over BC Baseline | 50 | 68 | +18 |
| ELI5 (Internal Validation) | Preference over BC Baseline | 50 | 58 | +8 |