RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

📝 Paper Summary

LLM Alignment Reinforcement Learning from Human Feedback (RLHF) Generative Model Fine-tuning

RAFT aligns generative models by iteratively generating samples, filtering them via a reward model, and fine-tuning on the high-reward subset, offering a stable alternative to PPO.

Core Problem

Standard RLHF using PPO is notoriously unstable, inefficient, and memory-intensive, requiring four simultaneous models during training.

Why it matters:

PPO's 'trial-and-error' learning is less stable and efficient than supervised learning
Loading multiple models (actor, critic, ref, reward) causes heavy memory burden
Pre-determined offline datasets for SFT often lack sufficient coverage to compete with optimal policies

Concrete Example: In standard PPO, a model must explore actions and update via complex gradients while maintaining a critic and reference model. If the reward signal is noisy, PPO training often collapses or hacks the reward, whereas RAFT simply filters out the bad samples before training.

Key Novelty

Iterative Reward-Ranked Fine-Tuning

Decouples data generation from model training: the model generates candidate responses, a reward model ranks them, and only the top candidates are kept.
Uses standard Supervised Fine-Tuning (SFT) on these self-generated 'best-of-K' samples, which is more stable than policy gradient methods.
Iterates this process: the improved model generates better data in the next round, progressively approximating the optimal policy.

Architecture

The three-step iterative process of RAFT: Data Collection, Data Ranking, and Model Fine-tuning.

Evaluation Highlights

RAFT outperforms PPO on the HH-RLHF dataset, achieving a reward of -1.09 vs -1.25 (lower is better in this specific reward scale context, or higher if normalized - paper shows RAFT consistently higher reward curves).
In GPT-4 evaluation, RAFT wins 57.0% of the time against the PPO baseline on the HH-RLHF test set.
Achieves superior performance on diffusion models (Stable Diffusion) for aesthetic improvement, raising aesthetic score from 4.72 to 5.60.

Breakthrough Assessment

8/10

Provides a highly effective, simpler, and more stable alternative to PPO for RLHF. The method (essentially iterative rejection sampling SFT) has since become a standard technique (e.g., in Llama-2/3 alignment) due to its robustness.

⚙️ Technical Details

Problem Definition

Setting: Aligning a generative model G with parameters w to maximize expected reward r(x,y) subject to KL divergence constraints.

Inputs: Prompts x from a distribution D

Outputs: Generated responses y

Pipeline Flow

Generator (Generates K samples per prompt)
Reward Model (Scores all K samples)
Ranker/Filter (Selects top sample based on score)
Fine-tuner (Updates Generator via SFT on selected samples)

System Modules

Generator

Generate candidate responses for a batch of prompts

Model or implementation: LLaMA-7B (for LLM exp) / Stable Diffusion v1.4 (for Vision exp)

Reward Model

Assign scalar scores to generated responses

Model or implementation: Open-LLaMA-3B fine-tuned on HH-RLHF (LLM) / Aesthetic Predictor (Vision)

Fine-tuner

Update model parameters using Supervised Learning

Model or implementation: Same as Generator

Novel Architectural Elements

Iterative loop decoupling generation and training: Generate -> Rank -> SFT -> Repeat, contrasting with PPO's online tight loop.

Modeling

Base Model: LLaMA-7B

Training Method: Iterative Supervised Fine-Tuning on Reward-Ranked Samples

Objective Functions:

Purpose: Select the best sample.

Formally: y* = argmax_{y_i} r(x, y_i)
Purpose: Train model to produce best sample.

Formally: Minimize -log p(y* | x)

Adaptation: LoRA (Low-Rank Adaptation) used for PPO baseline; Full fine-tuning implied for RAFT (or LoRA where memory constrained)

Trainable Parameters: LLaMA-7B parameters

Training Data:

HH-RLHF dataset (112K training samples)
Prompts extracted from HH-RLHF

Key Hyperparameters:

K (samples per prompt): 8
batch_size: 64 (LLM), 16 (Diffusion)
learning_rate: 2e-5 (LLM)
+ 3 more
epochs_per_iteration: 1
top_p: 0.95 (LLM sampling)
temperature: 1.0 or 2.0 (LLM)

Compute: 8x A40 (48G) GPUs. RAFT allows loading only 1 model at a time (decoupled), whereas PPO requires loading 4 models simultaneously.

Comparison to Prior Work

vs. PPO: RAFT is offline/iterative SFT rather than online policy gradient; more stable, less memory.
vs. SFT: RAFT trains on self-generated data filtered by reward, improving coverage beyond the fixed dataset.
vs. Best-of-K: RAFT distills the Best-of-K policy into the model weights, avoiding high inference cost.
+ 1 more
vs. RRHF: RAFT focuses on online samples from the model itself (improving behavior policy) rather than external sources [cited in paper].

Limitations

Depends heavily on the quality of the reward model; susceptible to reward hacking if the RM is flawed (though claimed more robust than PPO).
Computational cost of generating K samples per prompt during the data collection phase.
Requires an iterative process which may be slower in wall-clock time if generation is slow, despite being more stable.

Reproducibility

Code: https://github.com/LeslieOne/RAFT

Code is publicly available on GitHub. Hyperparameters and dataset details (HH-RLHF) are provided. Reward model training details are in Appendix.

📊 Experiments & Results

Evaluation Setup

Aligning LLaMA-7B on HH-RLHF dataset and Stable Diffusion on Aesthetic dataset.

Benchmarks:

HH-RLHF (Dialogue / Assistant Alignment)
Stable Diffusion Aesthetic (Text-to-Image Generation)

Metrics:

Reward Score (from Reward Model)
GPT-4 Win Rate
Perplexity
Aesthetic Score (CLIP-based)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on LLaMA-7B alignment shows RAFT achieving better reward scores and GPT-4 win rates compared to PPO.
HH-RLHF	Reward Score (Test Set)	-1.25	-1.09	+0.16
HH-RLHF	GPT-4 Win Rate	43.0	57.0	+14.0
HH-RLHF	Perplexity	5.38	5.53	+0.15
Diffusion model experiments demonstrate RAFT's ability to optimize aesthetic rewards.
Stable Diffusion Aesthetic	Aesthetic Score	4.72	5.60	+0.88
Stable Diffusion Aesthetic	Aesthetic Score	5.21	5.60	+0.39

Experiment Figures

Learning curves comparing Reward vs. Steps for RAFT and PPO on the HH-RLHF dataset.

Reward distribution histograms for PPO and RAFT.

Main Takeaways

RAFT effectively optimizes rewards while maintaining generation quality (low perplexity/high diversity) better than PPO.
The method is more robust to reward noise; filtering by ranking is less sensitive than optimizing raw scalar rewards directly.
Decoupling generation and training allows for significantly lower peak memory usage compared to on-policy RL methods.
Generalizes well to both Language Models and Diffusion Models.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Supervised Fine-Tuning (SFT)
Proximal Policy Optimization (PPO)
KL Divergence

Key Terms

RAFT: Reward rAnked FineTuning—the proposed method of filtering generated samples by reward and fine-tuning on the best ones.

SFT: Supervised Fine-Tuning—training a model to minimize negative log-likelihood on specific examples.

RLHF: Reinforcement Learning from Human Feedback—a paradigm to align models using a reward model trained on human preferences.

PPO: Proximal Policy Optimization—the standard RL algorithm used in RLHF, known for being complex and memory-heavy.

Best-of-K: A policy that generates K samples and selects the one with the highest reward score.

Rejection Sampling: The process of keeping only samples that meet a certain criteria (here, high reward ranking) and discarding the rest.