| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Pipeline effectiveness compared to baselines (Rejection Sampling, Zero-shot) judged by GPT-4. | ||||
| Pipeline Comparison | Win Rate | 1.6 | 98.4 | +96.8 |
| Refiner-only vs Refine-n-Judge | Win Rate | 27.5 | 72.5 | +45.0 |
| Performance of models fine-tuned on Refine-n-Judge data vs original TULU data. | ||||
| AlpacaEval | Win Rate | 79.3 | 84.8 | +5.5 |
| AlpacaEval 2.0 | Win Rate | 34.1 | 39.4 | +5.3 |
| MT-Bench | Score | 7.3 | 7.6 | +0.3 |
| AlpacaEval | Win Rate | 88.2 | 91.8 | +3.6 |