REBEL: Reinforcement Learning via Regressing Relative Rewards

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Language Model Fine-tuning Generative Model Alignment

REBEL replaces complex reinforcement learning heuristics with a simple regression objective that predicts the relative reward difference between two completions, theoretically matching Natural Policy Gradient while eliminating value networks.

Core Problem

Standard RL methods like PPO (Proximal Policy Optimization) are overly complex for fine-tuning large generative models, requiring multiple auxiliary networks (critics, reference models) and sensitive heuristics like clipping.

Why it matters:

Running PPO requires storing four large models in memory simultaneously (policy, reference, critic, reward model), creating massive computational overhead.
PPO's performance is notorious for being sensitive to implementation details like code-level optimizations and clipping thresholds.
Existing algorithms designed for small-scale continuous control do not scale efficiently to the era of billion-parameter generative models.

Concrete Example: In PPO, if a new policy update drastically increases the probability of a good response, the 'clipping' heuristic forcibly limits the update to prevent distribution shift, potentially discarding valid learning signals. REBEL avoids this by simply regressing the policy's log-ratios to match the reward difference directly.

Key Novelty

Regression to Relative Rewards (REBEL)

Reduces the RL optimization problem to a sequence of standard least-squares regression tasks on iteratively collected data.
Uses the policy network itself to predict the difference in rewards between two trajectories, eliminating the need for a separate value function (critic).
Demonstrates that this regression approach is theoretically equivalent to Natural Policy Gradient (NPG) but can be solved with simple first-order optimizers.

Evaluation Highlights

30.1% length-controlled win-rate on AlpacaEval 2.0 using Llama-3-8B-Instruct (without GPT-4 queries).
Average score of 68.2 on the Open LLM Leaderboard using Llama-3-8B-Instruct.
Average score of 8.16 on MT-Bench using Llama-3-8B-Instruct.

Breakthrough Assessment

8/10

Offers a significant simplification of RLHF by unifying it with regression, backed by strong theoretical links to NPG and competitive empirical results on major LLM benchmarks.

⚙️ Technical Details

Problem Definition

Setting: KL-constrained Reinforcement Learning (Contextual Bandits setting for deterministic transitions)

Inputs: Prompt x and Reward function r(x,y)

Outputs: Optimized Policy π(y|x) maximizing reward while staying close to reference π_ref

Pipeline Flow

Prompt Sampling (x ~ ρ)
Trajectory Generation (y ~ π_t, y' ~ μ)
Reward Querying (r(x,y), r(x,y'))
Regression Update (Minimize difference between predicted log-ratios and reward differences)

System Modules

Policy Network (Regressor)

Generates completions and serves as the regressor predicting reward differences

Model or implementation: Llama-3-8B-Instruct (Language) / Consistency Model (Image)

Reward Oracle

Provides scalar rewards for prompt-completion pairs

Model or implementation: External Reward Model or Metric (e.g., Rouge, learned preference model)

Novel Architectural Elements

Unified Policy-Regressor: The policy network is reused to regress relative rewards, removing the need for a separate Value Network (Critic).

Modeling

Base Model: Llama-3-8B-Instruct (Text), 6.9B parameter model (Text - Summarization), Consistency Model (Image)

Training Method: REBEL (Regression to Relative Rewards)

Objective Functions:

Purpose: Minimize the error between the policy's log-ratio difference and the actual reward difference.

Formally: E[( (r(x,y) - r(x,y')) - 1/η * ( ln(π(y|x)/π_ref(y|x)) - ln(π(y'|x)/π_ref(y'|x)) ) )^2]

Training Data:

Iteratively collected datasets D_t = {x, y, y'}
y sampled from current policy π_t
y' sampled from base distribution μ (can be offline data or π_t)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PPO: REBEL uses square loss regression instead of clipped surrogate objectives and removes the critic network.
vs. NPG: REBEL recovers NPG updates via Gauss-Newton but uses simpler first-order optimizers for scalability.
vs. DPO: REBEL handles continuous reward values and iterative online data collection, whereas DPO typically uses fixed offline preference pairs.

Limitations

Depends on the ability to solve the regression problem accurately at each iteration.
Requires access to a reward function (oracle or model) to query scalar rewards, unlike pure preference-based methods.

Reproducibility

Code: https://github.com/ZhaolinGao/REBEL

Code is publicly available at https://github.com/ZhaolinGao/REBEL. Models are available at https://huggingface.co/Cornell-AGI. Hyperparameters for Llama-3 fine-tuning are not explicitly detailed in the text snippet.

📊 Experiments & Results

Evaluation Setup

Fine-tuning Generative Models (LLMs and Image Generators)

Benchmarks:

AlpacaEval 2.0 (Instruction Following / Chat)
MT-Bench (Multi-turn Conversation)
Open LLM Leaderboard (General Language Understanding)

Metrics:

Length-controlled win-rate
Average score (MT-Bench, Open LLM Leaderboard)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AlpacaEval 2.0	Length-controlled win-rate	Not reported in the paper	30.1%	Not reported in the paper
MT-Bench	Average Score	Not reported in the paper	8.16	Not reported in the paper
Open LLM Leaderboard	Average Score	Not reported in the paper	68.2	Not reported in the paper

Main Takeaways

REBEL provides a unified approach for both language modeling and image generation.
Empirically outperforms PPO, DPO, REINFORCE, and RLOO on TL;DR summarization (qualitative statement from text).
Achieves competitive performance on major LLM benchmarks (AlpacaEval, MT-Bench) without requiring online GPT-4 queries during training.
Converges faster than PPO in image generation tasks with similar asymptotic performance.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
KL Divergence
Language Model Fine-tuning

Key Terms

PPO: Proximal Policy Optimization—a widely used RL algorithm that constrains policy updates via clipping to ensure stability.

NPG: Natural Policy Gradient—an RL algorithm that adjusts updates based on the geometry of the policy space (Fisher Information) rather than parameter space.

DPO: Direct Preference Optimization—an algorithm that optimizes policies using preference pairs directly, bypassing explicit reward modeling.

FIM: Fisher Information Matrix—a matrix measuring the amount of information that an observable random variable carries about an unknown parameter, used in NPG.

RLOO: Reinforce Leave-One-Out—a variance-reduced policy gradient estimator.

Partition function: A normalizing constant Z(x) in probability distributions; REBEL eliminates this term by differencing two samples.

Consistency Models: A class of generative models for images that REBEL optimizes in the image generation experiments.