| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Reward Modeling Results: UltraRM outperforms open-source baselines on preference prediction accuracy. | ||||
| Average (4 Datasets) | Accuracy | 60.1 | 71.0 | +10.9 |
| OpenAI WebGPT | Accuracy | 62.6 | 65.2 | +2.6 |
| Chat Model Performance: UltraLM-13B-PPO achieves state-of-the-art performance among open models. | ||||
| AlpacaEval | Win Rate % | 92.7 | 86.3 | -6.4 |
| Evol-Instruct | Win Rate % | 50.0 | 57.8 | +7.8 |
| UltraChat | Win Rate % | 50.0 | 64.9 | +14.9 |
| Average (3 Benchmarks) | Win Rate % | 52.9 | 69.7 | +16.8 |