Factored Causal Representation Learning for Robust Reward Modeling in RLHF

📝 Paper Summary

RLHF (Reinforcement Learning from Human Feedback) Reward Modeling Causal Representation Learning

CausalRM mitigates reward hacking by explicitly splitting the reward model's internal representation into causal factors (used for prediction) and non-causal factors (suppressed via adversarial training), ensuring rewards track true quality rather than spurious shortcuts.

Core Problem

Standard reward models in RLHF learn spurious correlations (e.g., length, sycophancy) instead of true human preferences, leading to 'reward hacking' where policies exploit these shortcuts to get high scores without improving quality.

Why it matters:

LLMs aligned with hacked reward models drift away from intended human objectives, becoming verbose or sycophantic rather than helpful
Existing fixes like length-regularization require knowing the spurious features beforehand, but it is impossible to anticipate all possible exploitation patterns in practice
Current representation learning methods (like contrastive learning) do not explicitly disentangle reward-relevant signals from irrelevant noise

Concrete Example: In math reasoning, a standard reward model might assign a higher score to a wrong but long answer over a correct but short one. CausalRM forces the model to ignore length (a non-causal factor) and focus only on the reasoning logic (causal factor).

Key Novelty

CausalRM (Causal Reward Model)

Decomposes the reward model's latent embedding into two parts: 'causal factors' (sufficient for reward) and 'non-causal factors' (irrelevant attributes like length)
Forces the reward head to use ONLY the causal factors for prediction, structurally preventing it from accessing spurious information
Uses an adversarial 'gradient reversal' head on the non-causal factors to actively strip out any remaining reward-relevant information, ensuring clean separation

Architecture

The CausalRM architecture. The LLM backbone output is split into causal (z_c) and non-causal (z_nc) latents. A reward head predicts from z_c. An adversarial head predicts from z_nc (with gradient reversal). A decoder reconstructs the original embedding from both.

Evaluation Highlights

+2.6% accuracy improvement on out-of-distribution math reasoning benchmarks (averaged) compared to the second-best baseline
Robustness to sycophancy: accuracy drops only 1.7% on hacked test sets compared to 11.4% drop for Standard RM
Eliminates length bias: predicted rewards remain flat across answer lengths (std dev 0.03), whereas baselines exhibit strong bias (std dev up to 0.22)

Breakthrough Assessment

7/10

Strong theoretical grounding in causal learning applied to a critical RLHF problem. Empirically effective at stopping reward hacking without needing predefined lists of spurious features. Mostly incremental architectural change but highly effective.

⚙️ Technical Details

Problem Definition

Setting: Learning a scalar reward function r(x,y) from preference pairs (x, yw, yl) that is robust to non-causal spurious features

Inputs: Prompt x and response y

Outputs: Scalar reward score

Pipeline Flow

LLM Backbone (Encodes prompt+response to h)
VAE Encoder (Splits h into z_c and z_nc)
Causal Head (Predicts reward from z_c)
Adversarial Head (Predicts reward from z_nc w/ Gradient Reversal)
Reconstruction Decoder (Rebuilds h from z_c + z_nc)

System Modules

LLM Backbone

Compute contextual embedding from input text

Model or implementation: Qwen2.5-Math-7B (Math) / Qwen2.5-7B (Dialogue)

VAE Encoder

Map embedding h to two latent distributions (causal and non-causal)

Model or implementation: Linear projections to mean/log-variance

Reward Head

Predict scalar reward using ONLY causal factors

Model or implementation: Linear layer

Adversarial Head (Regularization)

Try to predict reward from non-causal factors (to guide encoder to REMOVE this info)

Model or implementation: Linear layer with Gradient Reversal Layer (GRL)

Reconstruction Decoder (Regularization)

Ensure no information is lost by reconstructing original embedding

Model or implementation: Linear decoder

Novel Architectural Elements

Factored latent bottleneck: splitting embedding into z_c and z_nc with specific causal roles
Adversarial head on z_nc combined with reward head on z_c to enforce causal invariance

Modeling

Base Model: Qwen2.5-Math-7B (for Math tasks) and Qwen2.5-7B (for Dialogue tasks)

Training Method: Multi-objective optimization with minimax game

Objective Functions:

Purpose: Ensure causal factors predict preferences.

Formally: Standard pairwise negative log-likelihood loss on reward head output g(z_c).
Purpose: Keep causal representation minimal.

Formally: KL divergence between posterior q(z_c|h) and prior p(z).
Purpose: Ensure non-causal factors do NOT contain reward info.

Formally: Adversarial preference loss on a(z_nc) with gradient reversal (maximize loss for encoder, minimize for head).
Purpose: Prevent degenerate representations (collapse).

Formally: L2 reconstruction loss ||h - d([z_c; z_nc])||^2.

Training Data:

Math: OpenMathInstruct-1 preference dataset (1.8M pairs)
Dialogue: Anthropic-RLHF-HH (Helpful & Harmless)

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
lambda_c_KL: Not explicitly reported in the paper
+ 1 more
lambda_adv: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. ODIN: ODIN requires specifying 'length' explicitly; CausalRM learns the non-causal factor automatically.
vs. InfoRM: InfoRM filters info but doesn't explicitly split the space; CausalRM factorizes it into z_c and z_nc to enforce invariance.
vs. CRA [not cited in paper]: CRA uses backdoor adjustment with specific confounders; CausalRM uses latent disentanglement without naming confounders.

Limitations

Computational overhead of VAE and auxiliary heads during training (though inference only needs causal head)
Relies on the assumption that causal and non-causal factors can be linearly disentangled from the backbone embedding
Hyperparameter sensitivity (balancing 5 loss components) is likely high, though not analyzed in depth

Reproducibility

No code URL provided. Hyperparameters (learning rates, loss coefficients lambda) are symbolic in the method section but specific values are not listed in the main text or available appendices. Datasets and base models are public.

📊 Experiments & Results

Evaluation Setup

Reward modeling accuracy and downstream RLHF policy performance on Math and Dialogue

Benchmarks:

GSM8K / MATH (Mathematical Reasoning (ID))
Algebra222 / GSM-Hard / SVAMP (Mathematical Reasoning (OOD))
Anthropic-HH (Dialogue Helpfulness/Harmlessness (ID))
MT-Bench / TruthfulQA (Dialogue Generalization (OOD))

Metrics:

Pairwise Accuracy (Reward Modeling)
Final Answer Accuracy (Math RLHF)
Win Rate vs Baseline (Dialogue RLHF via Qwen3-Max judge)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Reward modeling accuracy results showing CausalRM generalizes better, especially on OOD math tasks.
MATH + GSM8K (Avg)	Pairwise Accuracy (%)	68.3	70.1	+1.8
Math OOD Avg (5 datasets)	Pairwise Accuracy (%)	83.0	85.6	+2.6
Dialogue OOD Avg (4 datasets)	Pairwise Accuracy (%)	60.7	62.3	+1.6
Downstream RLHF performance showing that policies trained with CausalRM solve more problems and win more head-to-head comparisons.
Math ID Avg	Final Answer Accuracy (%)	72.5	74.0	+1.5
Anthropic-Helpful (ID)	Win Rate vs SFT (%)	50.0	76.2	+26.2
MT-Bench (OOD)	Win Rate vs SFT (%)	42.1	44.3	+2.2
Hacked Anthropic-HH (ID)	Pairwise Accuracy (%)	59.2	70.6	+11.4

Experiment Figures

Reward hacking diagnosis on math tasks. Plots normalized predicted reward vs. actual gold accuracy over training steps.

Sensitivity of predicted reward to response length.

Main Takeaways

CausalRM consistently outperforms baselines on OOD tasks, suggesting it learns more generalizable features rather than dataset-specific shortcuts.
The method effectively neutralizes length bias: while baselines prefer shorter/longer answers irrationally, CausalRM's rewards are invariant to length changes.
RLHF training is more stable: the gap between 'predicted reward' and 'actual gold accuracy' (reward hacking gap) is much smaller for CausalRM than InfoRM or Standard RM.
Sycophancy mitigation is achieved without explicit supervision on what sycophancy looks like, purely through the causal/non-causal factorization.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry model for preference modeling
Variational Autoencoders (VAEs) and the reparameterization trick
Adversarial training / Gradient Reversal Layers (GRL)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—aligning AI models by training them to maximize a reward model learned from human preferences

reward hacking: When an AI exploits flaws in a reward model to get a high score without actually achieving the intended goal (e.g., writing long gibberish because length correlates with score)

spurious correlation: A statistical pattern that looks like a cause but isn't (e.g., 'longer answers are better' is a correlation, not a causal rule)

causal factors: Latent variables that are truly sufficient and necessary to determine the quality/reward of a response

gradient reversal layer: A network layer that flips the sign of gradients during backpropagation, used here to make the encoder unlearn information that the adversary tries to predict

information bottleneck: A technique that restricts the amount of information a representation can hold, forcing the model to keep only the most essential features

sycophancy: The tendency of a model to agree with the user's bias or prompt rather than being truthful

KL divergence: A statistical distance measuring how one probability distribution differs from another; used here as a regularizer

PPO: Proximal Policy Optimization—the standard reinforcement learning algorithm used to train the language model policy

SFT: Supervised Fine-Tuning—the initial phase of training a model on high-quality examples before RLHF

backbone: The pre-trained language model (e.g., Qwen) used to extract features from text

MMD: Maximum Mean Discrepancy—a statistical measure of the distance between two probability distributions