Evaluation Setup
Single-turn dialogue generation and summarization.
Benchmarks:
- Anthropic HH (Dialogue / Helpful & Harmless Assistant)
- Reddit TL;DR (Summarization)
- AlpacaEval 2 (Instruction Following (Open-Ended))
Metrics:
- Win Rate (vs Chosen/Reference)
- Length-Controlled Win Rate (AlpacaEval 2)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on Anthropic HH dataset across Pythia model sizes, evaluated by GPT-4. |
| Anthropic HH |
Win Rate |
51.51 |
57.07 |
5.56
|
| Anthropic HH |
Win Rate |
42.78 |
48.67 |
5.89
|
| Anthropic HH |
Win Rate |
26.19 |
30.18 |
3.99
|
| Generalization to Llama-3 and Mistral models on AlpacaEval 2. |
| AlpacaEval 2 |
Win Rate |
38.97 |
40.18 |
1.21
|
| AlpacaEval 2 |
Win Rate |
30.56 |
32.13 |
1.57
|
Main Takeaways
- Dynamic β calibration consistently improves win rates across all tested model sizes (410M to 8B).
- The method is robust to sampling temperature; standard DPO degrades rapidly at high temperatures while β-DPO maintains performance.
- Batch-level calibration is superior to instance-level calibration, as instance-level adjustments can lead to instability and overfitting to outliers.
- The approach is orthogonal to the specific loss function, showing gains when applied to DPO, IPO, KTO, and SimPO.