SLiC-HF: Sequence Likelihood Calibration with Human Feedback

📝 Paper Summary

Alignment Reinforcement Learning from Human Feedback (RLHF) Text Summarization

SLiC-HF aligns language models to human preferences using a margin-based contrastive loss over generated sequences, eliminating the need for complex reinforcement learning optimization like PPO.

Core Problem

Standard RLHF using PPO (Proximal Policy Optimization) is computationally expensive, memory-intensive, and difficult to tune due to the need for separate value networks and online sampling during training.

Why it matters:

PPO requires keeping multiple large models (policy, value, reward, reference) in memory, limiting the size of models that can be trained
The sampling (rollout) step inside the PPO training loop significantly slows down optimization compared to standard supervised learning
Reference-based metrics (ROUGE) fail to capture quality beyond the gold standard, but PPO's complexity creates a barrier to adopting human preference alignment

Concrete Example: In summarization, a model might generate a technically accurate but boring summary. A PPO-based RLHF approach would attempt to fix this by running generation loops during training to estimate values, whereas SLiC-HF simply compares the probability of a 'good' summary vs. a 'bad' one offline.

Key Novelty

Sequence Likelihood Calibration with Human Feedback (SLiC-HF)

Adapts Sequence Likelihood Calibration (SLiC) to use human preference rankings instead of similarity-to-reference metrics
Uses a pairwise contrastive loss that forces the model to assign higher probability to the preferred sequence in a pair compared to the dispreferred one
Enables learning from off-policy data (samples generated by other models) effectively, decoupling generation from optimization

Architecture

Input text formats for training the Reward Model vs. the Ranking Model.

Evaluation Highlights

SLiC-HF (T5-Large, 770M params) achieves a 66% win rate against a significantly larger RLHF-PPO baseline (6B params) in human evaluation
Ranking-based filtering combined with SLiC-HF improves win rate against reference summaries from 44.96% (SFT) to 86.21% on TL;DR dataset
Scaling the base model to T5-XXL (11B) yields a 96.10% win rate against human references according to the ranking model

Breakthrough Assessment

8/10

Provides a highly effective, simpler, and more efficient alternative to PPO for alignment. While similar to DPO (concurrent work), it demonstrates strong empirical results and off-policy capabilities.

⚙️ Technical Details

Problem Definition

Setting: Conditional language generation aligned with pairwise human preferences

Inputs: Document x

Outputs: Summary y

Pipeline Flow

SFT Model (Generator) -> Sample Candidates
Candidates -> Ranking Model -> Positive/Negative Pairs
Positive/Negative Pairs -> SLiC-HF Loss Optimization

System Modules

Generator (Policy)

Generates candidate summaries

Model or implementation: T5-Large (770M) or T5-XXL (11B)

Ranking Model

Compares two summaries and predicts the preferred one

Model or implementation: T5-XXL (11B)

Reward Model

Alternative to Ranking Model; scores a single summary

Model or implementation: T5-XXL (11B)

Novel Architectural Elements

Integration of calibration loss (margin ranking) specifically with human preference data or preference-model-labelled data

Modeling

Base Model: T5-Large (770M parameters) and T5-XXL (11B parameters)

Training Method: Sequence Likelihood Calibration (SLiC)

Objective Functions:

Purpose: Maximize probability gap between positive and negative sequences.

Formally: L_cal = max(0, delta - log P(y+|x) + log P(y-|x))
Purpose: Regularization to maintain linguistic quality.

Formally: L_reg = -lambda * log P(y_ref|x)

Training Data:

Reddit TL;DR dataset (117k train, 6k val, 6k test)
Human feedback data (64k preferences)

Key Hyperparameters:

margin_delta: 1.0
learning_rate: 1e-5
batch_size: 32
+ 1 more
num_candidates_m: 8 (sampled per example)

Compute: Significantly more efficient than RLHF-PPO (no rollouts during training loop; 1/4 memory usage for weights during training)

Comparison to Prior Work

vs. RLHF-PPO: SLiC-HF uses a contrastive loss instead of RL, requires no value network, and allows offline data usage
vs. BRIO: SLiC-HF aligns to human/learned preferences rather than ROUGE-based reference similarity
vs. DPO (Direct Preference Optimization) [not cited in paper]: DPO derives the loss from the optimal RL policy solution, whereas SLiC-HF uses a margin-based calibration loss; both avoid explicit PPO

Limitations

Smaller Ranking/Reward models (below T5-XXL) did not converge reliably, requiring large auxiliary models
SLiC-HF-direct (using HF data directly without sampling) causes length inflation
Performance depends heavily on the quality of the Ranking Model used to label samples

Reproducibility

Code not provided. Uses standard T5 models and public Reddit TL;DR dataset. Hyperparameters for SLiC margin and learning rates are provided.

📊 Experiments & Results

Evaluation Setup

Abstractive Summarization on Reddit TL;DR dataset

Benchmarks:

Reddit TL;DR (Abstractive Summarization)

Metrics:

Human Side-by-Side (SxS) Win Rate
Ranker Win Rate (Auto-eval against reference)
ROUGE (R1/R2/RL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human evaluation demonstrates SLiC-HF outperforms a significantly larger RLHF-PPO baseline.
Reddit TL;DR	Human Win Rate	34.0	66.0	+32.0
Reddit TL;DR	Human Win Rate	44.0	56.0	+12.0
Automatic evaluation using the Ranking Model shows significant improvements over SFT and benefits of scaling.
Reddit TL;DR	Ranker Win Rate (vs Reference)	44.96	86.21	+41.25
Reddit TL;DR	Ranker Win Rate (vs Reference)	62.34	96.10	+33.76

Experiment Figures

Average quality scores of models bucketed by relative length to reference summaries.

Length-controlled win rates against SFT and RLHF baselines.

Main Takeaways

SLiC-HF provides a competitive alternative to PPO, outperforming a 6B PPO model with a 770M SLiC model.
Pairwise ranking models align better with human preferences (73.23% accuracy) compared to pointwise reward models (71.34%).
Using the Ranking model to filter candidates for SLiC training (Sample-Rank) is more robust and effective than using raw human feedback data (Direct).
SLiC-HF enables memory savings (4x reduction in training weight memory) and faster training steps compared to RLHF-PPO.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Contrastive Learning
Language Modeling (sequence likelihood)

Key Terms

SLiC: Sequence Likelihood Calibration—a method to align the model's assigned probabilities with a ranking of candidate sequences

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models to maximize a reward signal derived from human preferences

PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used in RLHF to update the policy while preventing drastic deviations

SFT: Supervised Fine-Tuning—the initial training phase using ground truth labels, used as a starting point for alignment

Off-policy: Learning from data generated by a different policy (model) than the one currently being trained

Ranking Model: A model trained to output which of two candidate summaries is better (pairwise), rather than assigning a single score

Reward Model: A model trained to assign a scalar score to a single summary (pointwise)

TL;DR: Too Long; Didn't Read—a dataset of Reddit posts and their user-written summaries

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a standard metric for summarization based on n-gram overlap with a reference