| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Text Domain: ULTRAFUSER outperforms both the generalist baseline and the text specialist. | ||||
| TruthfulQA | Acc | 52.88 | 57.77 | +4.89 |
| AlpacaEval | Win Rate | 89.18 | 89.25 | +0.07 |
| Code Domain: ULTRAFUSER surpasses the coding specialist and GPT-3.5. | ||||
| HumanEval | Pass@1 | 48.78 | 53.03 | +4.25 |
| Math Domain: ULTRAFUSER outperforms the math specialist and shows massive gains over generalist models. | ||||
| GSM8K | Pass@1 | 55.00 | 59.30 | +4.30 |
| MATH | Pass@1 | 11.10 | 12.30 | +1.20 |