| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| P-Aligner significantly outperforms the Normal (raw instruction) baseline across all benchmarks on GPT-4-turbo. | ||||
| Vicuna Eval | Win Rate (GPT-4-turbo) | 50.00 | 78.75 | +28.75 |
| Self-Instruct Eval | Win Rate (GPT-4-turbo) | 50.00 | 85.32 | +35.32 |
| Dolly Eval | Win Rate (GPT-4-turbo) | 50.00 | 68.50 | +18.50 |
| P-Aligner also outperforms the BPO baseline on GPT-4-turbo, showing the value of principled synthesis over heuristic data. | ||||
| Vicuna Eval | Win Rate (GPT-4-turbo) | 73.75 | 78.75 | +5.00 |
| Results on open-source models (Gemma-2-SimPO) show consistent but slightly smaller gains. | ||||
| Vicuna Eval | Win Rate (Gemma-2-SimPO) | 50.00 | 56.25 | +6.25 |
| BPO Test | Win Rate (Gemma-2-SimPO) | 50.00 | 65.00 | +15.00 |