| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main instruction following results showing substantial gains over the seed model and baselines. | ||||
| AlpacaEval 2 | LC Win Rate | 22.9 | 39.4 | +16.5 |
| Arena-Hard | Win Rate | 20.6 | 29.1 | +8.5 |
| AlpacaEval 2 | LC Win Rate | 35.5 | 39.4 | +3.9 |
| Judge accuracy results showing that training the judge improves its correlation with GPT-4. | ||||
| Self-Chosen Pairs (Agreement w/o ties) | Agreement % | 63.78 | 76.12 | +12.34 |