← Back to Paper List

REBEL: Reinforcement Learning via Regressing Relative Rewards

Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Bagnell, Jason D. Lee, Wen Sun
Cornell University, Princeton University, Carnegie Mellon University, Harvard University
Neural Information Processing Systems (2024)
RL Benchmark

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Language Model Fine-tuning Generative Model Alignment
REBEL replaces complex reinforcement learning heuristics with a simple regression objective that predicts the relative reward difference between two completions, theoretically matching Natural Policy Gradient while eliminating value networks.
Core Problem
Standard RL methods like PPO (Proximal Policy Optimization) are overly complex for fine-tuning large generative models, requiring multiple auxiliary networks (critics, reference models) and sensitive heuristics like clipping.
Why it matters:
  • Running PPO requires storing four large models in memory simultaneously (policy, reference, critic, reward model), creating massive computational overhead.
  • PPO's performance is notorious for being sensitive to implementation details like code-level optimizations and clipping thresholds.
  • Existing algorithms designed for small-scale continuous control do not scale efficiently to the era of billion-parameter generative models.
Concrete Example: In PPO, if a new policy update drastically increases the probability of a good response, the 'clipping' heuristic forcibly limits the update to prevent distribution shift, potentially discarding valid learning signals. REBEL avoids this by simply regressing the policy's log-ratios to match the reward difference directly.
Key Novelty
Regression to Relative Rewards (REBEL)
  • Reduces the RL optimization problem to a sequence of standard least-squares regression tasks on iteratively collected data.
  • Uses the policy network itself to predict the difference in rewards between two trajectories, eliminating the need for a separate value function (critic).
  • Demonstrates that this regression approach is theoretically equivalent to Natural Policy Gradient (NPG) but can be solved with simple first-order optimizers.
Evaluation Highlights
  • 30.1% length-controlled win-rate on AlpacaEval 2.0 using Llama-3-8B-Instruct (without GPT-4 queries).
  • Average score of 68.2 on the Open LLM Leaderboard using Llama-3-8B-Instruct.
  • Average score of 8.16 on MT-Bench using Llama-3-8B-Instruct.
Breakthrough Assessment
8/10
Offers a significant simplification of RLHF by unifying it with regression, backed by strong theoretical links to NPG and competitive empirical results on major LLM benchmarks.
×