Evaluation Setup
Fine-tuning Generative Models (LLMs and Image Generators)
Benchmarks:
- AlpacaEval 2.0 (Instruction Following / Chat)
- MT-Bench (Multi-turn Conversation)
- Open LLM Leaderboard (General Language Understanding)
Metrics:
- Length-controlled win-rate
- Average score (MT-Bench, Open LLM Leaderboard)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| AlpacaEval 2.0 |
Length-controlled win-rate |
Not reported in the paper |
30.1% |
Not reported in the paper
|
| MT-Bench |
Average Score |
Not reported in the paper |
8.16 |
Not reported in the paper
|
| Open LLM Leaderboard |
Average Score |
Not reported in the paper |
68.2 |
Not reported in the paper
|
Main Takeaways
- REBEL provides a unified approach for both language modeling and image generation.
- Empirically outperforms PPO, DPO, REINFORCE, and RLOO on TL;DR summarization (qualitative statement from text).
- Achieves competitive performance on major LLM benchmarks (AlpacaEval, MT-Bench) without requiring online GPT-4 queries during training.
- Converges faster than PPO in image generation tasks with similar asymptotic performance.