Pretraining Language Models with Human Preferences

📝 Paper Summary

Language Model Alignment Pretraining Objectives

Aligning language models with human preferences during pretraining via conditional training is more effective and robust than the standard practice of pretraining on raw data followed by finetuning.

Core Problem

Standard language models are pretrained to imitate internet text, essentially baking in undesirable behaviors (toxicity, PII leaks, bad code) that are difficult to fully remove during later finetuning.

Why it matters:

Models trained on raw data learn to imitate falsehoods, offensive comments, and buggy code, which violates safety and utility goals.
Post-hoc alignment (filtering, RLHF) struggles because large models resist forgetting their training data; filtering data beforehand severely reduces data quantity and diversity.

Concrete Example: When prompted with a toxic start, a standard MLE-pretrained model often continues with toxicity because it learned to imitate such patterns. Even after finetuning, models can be 'jailbroken' to reveal this underlying behavior.

Key Novelty

Pretraining with Human Feedback (PHF)

Instead of filtering bad data, keep it but tag it: pretrain the model on the full distribution but condition it on the 'quality' or 'safety' score of the text segments.
Use a reward function (e.g., toxicity classifier) to label text segments as <|good|> or <|bad|> during pretraining, allowing the model to learn world knowledge from all data while learning to generate only high-reward text at inference.

Architecture

Conceptual comparison of Conventional Pretraining, Pretraining with Feedback (PHF), and Finetuning. While not a circuit diagram, it illustrates the training flow and results.

Evaluation Highlights

Conditional training reduces the rate of undesirable content by up to an order of magnitude compared to standard pretraining (MLE).
Pretraining with feedback (PHF) outperforms the standard recipe of MLE pretraining followed by finetuning with feedback, achieving lower toxicity and PII rates.
Conditional training maintains downstream capabilities (GLUE, zero-shot tasks) comparable to standard MLE models, unlike filtering which harms performance.

Breakthrough Assessment

8/10

Challenges the dominant paradigm of 'pretrain then align' by showing that alignment should happen *during* pretraining. Simple method (conditional training) yields Pareto-optimal results.

⚙️ Technical Details

Problem Definition

Setting: Language Model Pretraining with access to a reward function R(x)

Inputs: Unlabeled document dataset D, segment-level reward function R

Outputs: Pretrained Language Model parameter theta

Pipeline Flow

Data Processing (Segment text, score with Reward Model)
Tokenization (Add control tokens <|good|> or <|bad|>)
Pretraining (Train GPT-2 small on modified objective)
Inference (Prompt with <|good|> to elicit aligned behavior)

System Modules

Reward Scorer

Assigns scalar rewards to text segments

Model or implementation: Task-dependent (Detoxify for toxicity, Scrubadub for PII, pycodestyle for code)

Conditional Trainer

Pretrains the LM to model the distribution of tokens conditional on reward tokens

Model or implementation: gpt2-small (124M parameters)

Novel Architectural Elements

Segment-level control tokens injected into the pretraining data stream based on reward thresholds.

Modeling

Base Model: gpt2-small (124M parameters)

Training Method: Pretraining with Human Feedback (PHF) using Conditional Training (and comparisons to others)

Objective Functions:

Purpose: Maximize likelihood of text given a prepended quality control token.

Formally: L(x) = sum log P(token_i | token_<i, control_token)
Purpose: Minimize likelihood of low-quality text (Unlikelihood).

Formally: L_MLE(high_reward) + alpha * L_Unlikelihood(low_reward)
Purpose: Weight updates by reward (RWR).

Formally: L(x) = exp(beta * R(x)) * L_MLE(x)

Trainable Parameters: All parameters (124M)

Training Data:

Toxicity/PII: 1.95M documents (3.32B tokens) subsampled from the Pile
Code: 1.5M Python files (3.32B tokens) from GitHub on BigQuery

Key Hyperparameters:

training_tokens: 3.32B
batch_size: Tuned per task-objective
learning_rate: Tuned per task-objective
+ 2 more
unlikelihood_alpha: Hyperparameter for Unlikelihood
RWR_beta: Hyperparameter for RWR

Compute: Not reported in the paper

Comparison to Prior Work

vs. Dataset Filtering: PHF keeps the 'bad' data but tags it, preserving data diversity and capabilities.
vs. MLE + Finetuning: PHF introduces preferences from the start, preventing the model from ever learning to comfortably generate toxic/bad content.

Limitations

Experiments limited to relatively small models (GPT-2 Small, 124M parameters).
Only evaluated on 3 specific alignment tasks (toxicity, PII, PEP8 code); may not generalize to complex reasoning or truthfulness.
Requires a reward function that can run efficiently on the entire pretraining corpus.
Did not explore online RL methods (like PPO) during pretraining due to computational cost.

Reproducibility

Code: https://github.com/tomekkorbak/pretraining-with-human-feedback

Code and datasets available at github.com/tomekkorbak/pretraining-with-human-feedback. Uses public datasets (The Pile, GitHub). Reward models (Detoxify, Scrubadub, pycodestyle) are open source.

📊 Experiments & Results

Evaluation Setup

Generative evaluation on three tasks: Toxicity (RealToxicityPrompts), PII leakage (generating form training data), and Python Code Quality (PEP8 compliance).

Benchmarks:

RealToxicityPrompts (Toxicity generation)
PII Extraction (Privacy leakage) [New]
PEP8 Code Generation (Code quality) [New]
GLUE (Natural Language Understanding (Capabilities))
HumanEval (Code functional correctness)

Metrics:

Toxicity Probability (Detoxify)
PII instances per character
PEP8 violations per character
Perplexity (PPL)
GLUE Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of Pretraining with Feedback (PHF) vs. Standard Finetuning approach. PHF (Conditional Training) consistently outperforms the 'Pretrain then Finetune' paradigm on alignment metrics.
Toxicity (Probability)	Toxicity Probability	0.016	0.011	-0.005
PII Leakage	PII/char	2.2e-4	0.5e-4	-1.7e-4
Adversarial Robustness. Conditional Training maintains lower toxicity even when prompted with toxic triggers (Red Teaming).
Red Teaming (Toxicity)	Toxicity Probability	0.45	0.25	-0.20
Downstream Capabilities. Conditional training does not degrade general capabilities unlike filtering.
GLUE	Average Score	78.0	78.0	0.0

Main Takeaways

Conditional training is Pareto-optimal across all three tasks (Toxicity, PII, Code), offering the best balance of alignment and capabilities.
Pretraining with human feedback is more effective than finetuning: it is harder to 'unlearn' bad behaviors (toxicity, PII memorization) than to never learn to generate them in the first place.
Filtering data helps alignment but hurts capabilities (perplexity, downstream tasks) by reducing data diversity. Conditional training avoids this by using all data.
Conditional training exhibits better robustness to adversarial prompting compared to standard MLE models.

📚 Prerequisite Knowledge

Prerequisites

Language Model Pretraining (MLE)
Reinforcement Learning from Human Feedback (RLHF)
Tokenization and Autoregressive Generation

Key Terms

PHF: Pretraining with Human Feedback—incorporating human preferences (via reward models) directly into the pretraining objective rather than just finetuning.

MLE: Maximum Likelihood Estimation—the standard pretraining objective where the model maximizes the probability of the next token in the training data.

Conditional Training: A technique where control tokens (e.g., <|good|>, <|bad|>) are prepended to text segments based on their reward score, teaching the model to distinguish quality.

Unlikelihood Training: An objective that minimizes the probability of tokens in low-reward segments, effectively teaching the model what *not* to generate.

RWR: Reward-Weighted Regression—an offline RL objective that weights the standard language modeling loss by the exponentiated reward of the segment.

AWR: Advantage-Weighted Regression—an offline RL objective that weights updates by the 'advantage' (reward minus a learned value baseline).

PII: Personally Identifiable Information—sensitive data like phone numbers or email addresses that models should not memorize or generate.

Pareto frontier: The set of optimal trade-offs where no metric can be improved without degrading another (here, alignment vs. capabilities).

PEP8: The standard style guide for Python code, used here as a proxy for code quality preferences.