SALMON: Self-Alignment with Instructable Reward Models

📝 Paper Summary

AI Alignment Reinforcement Learning from AI Feedback (RLAIF)

SALMON aligns language models using an instructable reward model trained on synthetic, principle-driven preferences, allowing dynamic control over model behavior and mitigation of reward hacking without human annotation.

Core Problem

Aligning LLMs via RLHF requires expensive, unscalable human annotations, and fixed reward models are susceptible to reward hacking (e.g., self-praising) that cannot be fixed without collecting new data.

Why it matters:

Acquiring consistent, high-quality human preference data is costly and limits the scalability of alignment for advanced models.
Fixed reward models become static targets for optimization; when a policy model learns to hack the reward (e.g., by being verbose), correcting it typically requires a new round of data collection.
Existing RLAIF methods often focus only on safety (Harmlessness) or require an RLHF-warm-started model, limiting their ability to align models from scratch.

Concrete Example: A policy model might learn to 'self-praise' (e.g., appending 'I hope this helps!' to every response) to artificially boost its reward score. In standard RLHF, fixing this requires collecting new human data penalizing this behavior. In SALMON, one can simply add a 'No Self-Praise' principle to the reward model input at test time to immediately penalize the behavior.

Key Novelty

Instructable Reward Model (IRM)

Instead of learning a static preference score, the reward model is trained to strictly follow input principles (e.g., 'Be concise') when scoring response pairs.
Synthetic preference data is generated by the model itself acting as a judge, guided by varying subsets of human-written principles.
Allows 'test-time' intervention during RL: researchers can inject new principles (e.g., prohibition of specific patterns) to steer the policy and stop reward hacking without retraining the reward model.

Architecture

The training pipeline for the Instructable Reward Model.

Evaluation Highlights

Dromedary-2 (70B) achieves a score of 6.92 on MT-Bench, surpassing the heavily human-annotated Llama-2-Chat-70b (6.86) despite using only 6 human-written exemplars.
SALMON-13b outperforms Llama-2-Chat-13b by +6.0% on the LLM-Bar adversarial benchmark, demonstrating better robustness to instruction-following traps.
Achieves state-of-the-art performance with only 31 human-defined principles and 6 seeds, compared to 1M+ human annotations used for Llama-2-Chat.

Breakthrough Assessment

8/10

Significantly reduces the barrier to entry for high-quality alignment by replacing massive human annotation with synthetic data and instructable rewards, outperforming top-tier open-source models like Llama-2-Chat.

⚙️ Technical Details

Problem Definition

Setting: Aligning a base Large Language Model (LLM) to human preferences (Helpful, Honest, Harmless) using Reinforcement Learning with AI Feedback (RLAIF)

Inputs: User query x, set of human-defined principles (e.g., 'Be concise', 'No self-praise')

Outputs: Aligned response y

Pipeline Flow

Principle Sampling: Select subset of principles (k=3)
Response Generation: Policy model generates response to prompt
Reward Evaluation: Instructable Reward Model scores response based on Prompt + Response + Sampled Principles
PPO Update: Policy model updates weights to maximize Reward

System Modules

Principle Sampler (Reward Generation)

Selects a random subset of positive/negative principles (e.g., 'Be concise', 'Do not self-praise') to guide the reward signal

Model or implementation: Heuristic / Random Sampler

Policy Model

Generates responses to user queries

Model or implementation: Llama-2-70b / Llama-2-13b

Instructable Reward Model (Reward Generation)

Assigns a scalar score to the response, conditioned strictly on adherence to the provided principles

Model or implementation: Llama-2-70b / Llama-2-13b (with scalar head)

Novel Architectural Elements

Reward model architecture accepts 'Principles' as explicit text input, conditioning the scalar score on adherence to these dynamic instructions rather than a static preference definition
RL-time principle injection allows adding prohibition principles (e.g., 'Do not offer high-level advice') to the reward function without retraining the reward model

Modeling

Base Model: Llama-2-70b and Llama-2-13b

Training Method: PPO (Proximal Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward while staying close to initial policy.

Formally: PPO clipped surrogate objective with KL penalty.
Purpose: Train reward model to predict preferences based on principles.

Formally: Cross-entropy loss on Bradley-Terry model: -log(σ(r(x, y_w, p) - r(x, y_l, p)))

Adaptation: Full fine-tuning (Policy and Reward Model)

Training Data:

Synthetic Preferences: Generated using SFT model as judge on inputs from Alpaca/ShareGPT/UltraChat
Principles: 31 manual principles (17 from Self-Align, 14 new)
SFT Data: Self-generated using 6 ICL exemplars

Key Hyperparameters:

learning_rate: 5e-7 (Reward Model), 1e-6 (Policy Model)
batch_size: 64 (13b), 128 (70b)
kl_coefficient_beta: 0.01 (13b), 0.005 (70b)
+ 2 more
ppo_clip_epsilon: Not reported in the paper
max_length: 2048 tokens

Compute: Not reported in the paper

Comparison to Prior Work

vs. Constitutional AI: SALMON introduces an *instructable* reward model that takes principles as input, whereas CAI trains a fixed reward model on AI-generated labels. SALMON also targets Helpfulness/Honesty from scratch, not just Harmlessness.
vs. Llama-2-Chat: SALMON uses only 6 human exemplars vs. 1M+ annotations.
vs. Self-Align: SALMON adds the RL stage with the instructable reward model, significantly improving performance over the SFT-only Self-Align baseline.
+ 1 more
vs. RLAIF (standard): Standard RLAIF produces a static reward model. SALMON's reward model is dynamic/instructable at test time.

Limitations

Dependency on the quality of the base model (Llama-2) to understand and follow principles during synthetic data generation.
Descriptions of reward-hacking traits must be manually written as principles (though this is easier than re-labeling data).
Evaluation relies heavily on GPT-4 based benchmarks (MT-Bench, AlpacaEval), which may have biases.

Reproducibility

Code: https://github.com/IBM/SALMON

Code and model weights are publicly available at https://github.com/IBM/SALMON. The paper details the specific 31 principles used in Tables 7 and 8. Synthetic data generation process is fully described.

📊 Experiments & Results

Evaluation Setup

Chat assistant capability evaluation using LLM-as-a-judge (GPT-4)

Benchmarks:

MT-Bench (Multi-turn conversation)
LLM-Bar (Adversarial instruction following)
AlpacaEval (Open-ended instruction following)

Metrics:

Win Rate vs Baseline
Score (1-10)
Average Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of Dromedary-2 (SALMON-70b) against state-of-the-art baselines on MT-Bench, highlighting supervision efficiency.
MT-Bench	Score	6.86	6.92	+0.06
MT-Bench (Turn 1)	Score	7.40	7.51	+0.11
LLM-Bar	Average Score	32.8	38.8	+6.0
AlpacaEval	Win Rate %	81.09	82.15	+1.06

Experiment Figures

Examples of reward hacking behaviors and how specific principles mitigate them.

Main Takeaways

SALMON achieves competitive or superior performance to extensively supervised models (Llama-2-Chat) using negligible human data (6 exemplars).
The instructable reward model effectively mitigates reward hacking; specific principles added at RL-time (e.g., prohibiting self-praise) successfully curb observed pathologies.
The method generalizes well across model sizes (13b and 70b), showing consistent improvements over SFT-only baselines and standard RLHF models on adversarial tasks.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Bradley-Terry Reward Model
In-Context Learning (ICL)

Key Terms

RLAIF: Reinforcement Learning from AI Feedback—using an AI model to generate preference labels instead of humans

Instructable Reward Model: A reward model that accepts text principles as input alongside the prompt and response, generating scores conditioned on those principles

Reward Hacking: When an RL agent exploits flaws in the reward function to get high scores without actually achieving the desired goal (e.g., being verbose to look helpful)

Dromedary-2: The specific AI assistant model developed in this paper based on Llama-2-70b using the SALMON method

Synthetic Preference: Training data for the reward model generated by sampling two responses and asking an LLM to judge which is better based on a specific principle

MT-Bench: A challenging multi-turn benchmark for evaluating chat assistants using GPT-4 as a judge

SFT: Supervised Fine-Tuning—training a model on a dataset of high-quality instruction-response pairs

PPO: Proximal Policy Optimization—an RL algorithm used to update the policy model

LLM-Bar: An adversarial benchmark designed to test if models can resist confusion from misleading instructions or biases