Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Reward Modeling AI Alignment

The paper introduces Causal Reward Modeling (CRM), which uses a regularization term based on Maximum Mean Discrepancy to ensure reward models ignore spurious features like response length or sycophancy.

Core Problem

Standard reward models in RLHF learn spurious correlations (like preferring longer answers regardless of quality), allowing models to 'hack' the reward function without actually improving alignment.

Why it matters:

Models develop harmful biases: favoring length over substance (length bias), agreeing with user errors (sycophancy), or discriminating against groups.
Increasing data size does not fix this and may worsen reward hacking by reinforcing these non-causal shortcuts.
Current mitigation strategies often target single biases (like length penalties) rather than the root cause of spurious correlations.

Concrete Example: If a reward model training set disproportionately labels long responses as 'better,' the model learns that length causes high reward. Consequently, the aligned LLM generates verbose, low-quality fluff to maximize this hacked reward.

Key Novelty

Causal Reward Model (CRM) via Counterfactual Invariance

Treats biases (length, sycophancy) as 'spurious factors' in a causal graph that should not influence the true reward.
Enforces 'counterfactual invariance': the reward shouldn't change if only the spurious feature changes (e.g., same quality answer, different length).
Achieves this without needing perfect counterfactual data by adding a regularization term (MMD) that forces the model's representation to be independent of the spurious variable.

Evaluation Highlights

Reduces length bias significantly: Win-rate against reference improves while average response length decreases compared to standard RLHF.
Mitigates sycophancy: On the sycophancy dataset, CRM reduces the rate of agreeing with incorrect user claims compared to vanilla RLHF.
Improves fairness: Reduces discrimination bias scores on benchmarks involving demographic groups compared to baselines.

Breakthrough Assessment

7/10

Offers a theoretically grounded, general-purpose solution to reward hacking using causal inference. While effective, it relies on identifying specific spurious variables (Z) beforehand.

⚙️ Technical Details

Problem Definition

Setting: Learning a reward model r(x,y) from preference pairs (yw, yl) that is invariant to a known spurious variable Z.

Inputs: Prompt x, Response y, Spurious attribute Z (e.g., length, sycophancy label)

Outputs: Scalar reward score r

Pipeline Flow

Input Processing: (Prompt, Response, Spurious Attribute Z)
Reward Modeling with Causal Regularization
RL Fine-tuning (PPO/ReMax)

System Modules

Reward Model (CRM)

Predicts scalar reward while minimizing dependence on spurious variable Z

Model or implementation: Initialized from SFT model (linear head added)

Novel Architectural Elements

Integration of an MMD (Maximum Mean Discrepancy) regularization term directly into the reward model loss function to enforce independence from variable Z.

Modeling

Base Model: Not explicitly named (implies standard LLM architectures used in RLHF)

Training Method: Causal Reward Modeling followed by RL

Objective Functions:

Purpose: Maximize likelihood of preferred responses (standard Reward Modeling).

Formally: L_RM = -E[log(sigma(r(x, yw) - r(x, yl)))]
Purpose: Enforce independence between reward representation and spurious variable Z.

Formally: L_MMD = MMD^2(P(f(T)|Z=0), P(f(T)|Z=1))
Purpose: Combined Training Objective.

Formally: L = L_RM + lambda * L_MMD

Key Hyperparameters:

lambda: Hyperparameter controlling the strength of the causal regularization (MMD term)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Length Penalty: CRM is a learning-based regularization rather than a post-hoc heuristic adjustment
vs. ODIN: CRM uses a causal regularizer (MMD) on the representation space rather than a specific architectural split of heads
vs. Counterfactual Data Augmentation [not cited in paper]: CRM does not require generating synthetic counterfactual data (which is difficult); it uses observational data with regularization instead

Limitations

Requires explicit identification and labeling of the spurious variable Z (e.g., length, demographic) during training.
Does not automatically discover unknown spurious correlations; limited to what the designer defines as Z.
Effectiveness depends on the quality of the kernel and the MMD approximation used.

Reproducibility

No replication artifacts mentioned in the paper. The method relies on identifying variable Z (e.g., length buckets) during training.

📊 Experiments & Results

Evaluation Setup

RLHF fine-tuning on tasks prone to specific biases (Length, Sycophancy, Discrimination).

Benchmarks:

Synthetic Length Bias Task (Controlled experiment forcing length correlation) [New]
Sycophancy Dataset (Dialogue where user makes incorrect claims)
Discrimination Benchmark (Demographic fairness evaluation)

Metrics:

Win Rate (vs Reference/Gold)
Average Response Length
Sycophancy Score (agreement with error)
Discrimination Score
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

CRM effectively decorrelates reward from spurious features: In length bias experiments, CRM models do not simply maximize length unlike baseline RLHF.
Generalizes across bias types: The same MMD regularization framework works for length, sycophancy, and discrimination without architectural changes.
Trade-off control: The hyperparameter lambda allows balancing between optimizing the primary reward signal and satisfying the causal invariance constraint.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry Model
Causal Inference (Counterfactuals, Interventions)
Maximum Mean Discrepancy (MMD)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—aligning AI models by training them to maximize a reward signal derived from human preferences

Spurious Correlation: A statistical association between two variables (e.g., length and quality) that is not causal, often leading models to learn incorrect shortcuts

Counterfactual Invariance: The property where a model's prediction remains the same even if a specific attribute (like length) is hypothetically changed, provided the core content remains constant

MMD: Maximum Mean Discrepancy—a statistical measure used to determine if two probability distributions are different; used here to force independence between representations and spurious factors

Reward Hacking: When an AI agent exploits flaws in the reward function to get a high score without actually achieving the intended goal (e.g., writing gibberish just to be long)

SFT: Supervised Fine-Tuning—the initial phase of training where the model learns to mimic high-quality human demonstrations

Bradley-Terry Model: A probability model used to predict which of two items (responses) is preferred, based on their latent reward scores

RKHS: Reproducing Kernel Hilbert Space—a mathematical space of functions used in the calculation of MMD