Safe RLHF: Safe Reinforcement Learning from Human Feedback

📝 Paper Summary

Safety Alignment Reinforcement Learning from Human Feedback (RLHF)

Safe RLHF decouples human preferences into separate helpfulness and harmlessness models, utilizing Lagrangian optimization to dynamically balance these conflicting objectives during LLM fine-tuning.

Core Problem

The objectives of helpfulness and harmlessness in LLMs often conflict (e.g., refusing to answer a harmful query is safe but unhelpful), and combining them into a single reward function confuses models.

Why it matters:

Models like ChatGPT must avoid generating discrimination or misinformation while remaining useful to users.
Static weighting of safety vs. helpfulness (Reward Shaping) requires manual tuning and often results in either unsafe models or over-defensive models that refuse benign queries.

Concrete Example: When asked 'How to be a serial killer?', a helpfulness-only model might provide a plan (unsafe), while a safety-only model might refuse all query types. A single reward model struggles to distinguish the nuance, leading to confusion during training.

Key Novelty

Safe Reinforcement Learning from Human Feedback (Safe RLHF)

Explicitly decouples data annotation and modeling into two separate components: a Reward Model for helpfulness and a Cost Model for harmlessness.
Formulates the alignment problem as a Constrained Markov Decision Process (CMDP), maximizing reward subject to a safety cost constraint.
Uses the Lagrangian method to dynamically adjust the penalty coefficient (lambda) during training, increasing the penalty when the model violates safety constraints and decreasing it otherwise.

Architecture

The Safe RLHF pipeline compared to conventional RLHF. It shows the decoupling of data annotation, the training of separate Reward and Cost models, and the constrained optimization loop.

Evaluation Highlights

Reduced the rate of harmful responses on the evaluation set from 53.08% (Alpaca-7B) to 2.45% (Beaver-v3) according to human labels.
Achieved a +244.91 increase in helpfulness Elo score and +268.31 in harmlessness Elo score (rated by GPT-4) compared to the base Alpaca-7B model.
Outperformed Reward Shaping (static weighting) baselines, achieving a better Pareto frontier between helpfulness and harmlessness.

Breakthrough Assessment

8/10

Significantly advances safety alignment by mathematically formalizing the trade-off as a constrained optimization problem rather than a heuristic sum of rewards. Practical impact is high with open-source release.

⚙️ Technical Details

Problem Definition

Setting: Constrained Markov Decision Process (CMDP)

Inputs: Natural language prompt x

Outputs: Generated response y

Pipeline Flow

User Prompt -> Aligned LLM (Beaver) -> Response

System Modules

Aligned LLM (Beaver)

Generate helpful and harmless responses to user prompts

Model or implementation: Fine-tuned LLaMA-7B (Alpaca-7B base)

Novel Architectural Elements

Integration of dual preference models (Reward Model and Cost Model) during the PPO training loop (architectural change in training pipeline, not inference)

Modeling

Base Model: Alpaca-7B (reproduced from LLaMA-7B)

Training Method: Safe RLHF (Lagrangian PPO)

Objective Functions:

Purpose: Maximize helpfulness while keeping harmlessness cost below a threshold.

Formally: min_theta max_lambda [-J_R(theta) + lambda * (J_C(theta) - threshold)]
Purpose: Reward Model Loss (Helpfulness).

Formally: Log-sigmoid pairwise ranking loss based on Bradley-Terry model.
Purpose: Cost Model Loss (Harmlessness).

Formally: Pairwise ranking loss plus a classification term to distinguish safe/unsafe boundaries.

Adaptation: Full fine-tuning

Training Data:

Dataset collected over 3 rounds of iterative refinement
Includes red-teaming prompts (excluded from round 1)
Decoupled annotations: dataset D_R (helpfulness) and D_C (harmlessness)

Key Hyperparameters:

learning_rate_actor: 1e-6
learning_rate_critic: 5e-6
batch_size: 512
+ 5 more
training_rounds: 3
kl_coefficient: 0.02
lambda_initial_value: 1.0
lambda_learning_rate: 0.1
discount_factor_gamma: 1.0

Compute: 8 × Nvidia A800-80G GPUs used for training.

Comparison to Prior Work

vs. RLHF: Safe RLHF adds a separate Cost Model and constraint mechanism.
vs. Reward Shaping: Safe RLHF dynamically adjusts the trade-off (lambda) instead of using fixed weights.
vs. Constitutional AI: Safe RLHF focuses on mathematical constraints in optimization rather than iterative prompting/AI feedback [not cited as direct baseline comparison in experiments].

Limitations

Relies on accessible pre-train data (Alpaca) which may be less powerful than proprietary SFT data.
Cost Model accuracy depends on the quality of crowdworker safety annotations.
Experiments limited to single-turn conversations.
High financial cost associated with human annotation and RLHF training.

Reproducibility

Code: https://github.com/PKU-Alignment/safe-rlhf

Code is publicly available at https://github.com/PKU-Alignment/safe-rlhf. Datasets for the three rounds of fine-tuning are released. The base model is Alpaca-7B (reproduced).

📊 Experiments & Results

Evaluation Setup

Iterative fine-tuning over 3 rounds (Beaver-v1 to v3). Evaluation using Unified Reward/Cost models, GPT-4, and Human annotators.

Benchmarks:

Evaluation Prompt Set (Safety and Helpfulness Evaluation) [New]

Metrics:

Elo Score (Helpfulness & Harmlessness)
Harmful Response Ratio
Reward/Cost Model Scores
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of the final Beaver-v3 model against the base Alpaca-7B model showing improvements in both helpfulness and safety.
Evaluation Prompt Set (Human Eval)	Harmful Response Ratio	53.08	2.45	-50.63
Evaluation Prompt Set (GPT-4 Eval)	Helpfulness Elo	1000	1244.91	+244.91
Evaluation Prompt Set (GPT-4 Eval)	Harmlessness Elo	1000	1268.31	+268.31
Ablation study comparing Safe RLHF dynamic optimization against Reward Shaping (static weighting).
Evaluation Prompt Set	Harmlessness Win Rate vs SFT	56.0	62.0	+6.0

Experiment Figures

Comparison of Safe RLHF vs Reward Shaping (RS) and the training dynamics of the Lagrange multiplier.

Scatter plots of Reward vs Cost for responses across training rounds (Alpaca, Beaver-v1, v2, v3).

Main Takeaways

Safe RLHF effectively reduces harmful outputs while maintaining or improving helpfulness, unlike static methods that often compromise one for the other.
Decoupling annotation for helpfulness and harmlessness increases Inter-Rater Agreement compared to single-dimensional preference labeling.
The Lagrangian multiplier dynamically adjusts during training: it lowers the safety penalty when the model is safe, avoiding over-optimization (refusal of benign prompts).
Iterative red-teaming in rounds 2 and 3 was crucial for uncovering and fixing latent safety vulnerabilities.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Constrained Optimization (Lagrangian multipliers)
Bradley-Terry Model

Key Terms

Safe RLHF: A framework extending RLHF by separating helpfulness (reward) and harmlessness (cost) and optimizing via Lagrangian methods.

Lagrangian method: An optimization technique that finds the local maxima/minima of a function subject to equality or inequality constraints by introducing Lagrange multipliers.

CMDP: Constrained Markov Decision Process—an extension of MDPs where the agent must maximize reward while satisfying auxiliary cost constraints.

Reward Shaping: A baseline method where helpfulness and harmlessness rewards are combined into a single scalar using a fixed static weight.

Red-teaming: A practice where humans actively try to provoke the model into generating harmful or unsafe content to identify vulnerabilities.

Bradley-Terry model: A probability model used to predict the outcome of a pairwise comparison (e.g., preference between two responses).

SFT: Supervised Fine-Tuning—the initial phase of training on high-quality demonstration data before RL.

PPO: Proximal Policy Optimization—a policy gradient method for reinforcement learning.