Interpretable Reward Model via Sparse Autoencoder

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Mechanistic Interpretability Reward Modeling

SARM enhances reward model interpretability by projecting hidden states into sparse features using a Sparse Autoencoder, allowing rewards to be calculated as a linear combination of understandable concepts.

Core Problem

Traditional scalar reward models in RLHF are opaque black boxes that offer no explanation for their scores and cannot be easily adjusted when user preferences shift.

Why it matters:

Opacity prevents verifying if models align with human values or merely exploit spurious correlations in training data
Static reward models cannot adapt to changing user needs without expensive retraining or fine-tuning
Existing multidimensional reward models increase annotation costs significantly and still lack granular feature-level transparency

Concrete Example: A scalar reward model might assign a low score to a helpful response without explanation. It is unclear if the penalty is due to factual inaccuracy, safety violations, or tone. SARM decomposes this score into features like 'ethical reasoning' (positive) or 'hallucination' (negative), enabling precise diagnosis.

Key Novelty

Sparse Autoencoder-enhanced Reward Model (SARM)

Integrates a pretrained Sparse Autoencoder (SAE) into the intermediate layers of a reward model to translate dense neural activations into sparse, human-understandable features
Computes the final scalar reward as a weighted sum of these interpretable features, making the reward assignment explicitly decomposable
Enables 'steering' of the reward model by manually adjusting the weights of specific features (e.g., upweighting 'safety') without retraining

Architecture

The training pipeline of SARM, showing the transition from standard LLM layers to the Sparse Autoencoder and finally to the linear reward head.

Evaluation Highlights

SARM achieves superior performance relative to conventional reward models on RewardBench 2, particularly in safety and alignment metrics
Case studies demonstrate successful manipulation of reward distributions for specific features (e.g., safety) by adjusting feature weights, with minimal impact on unrelated distributions
Identifies clear monosemantic features (e.g., 'ethical considerations', 'mathematical reasoning') that correlate with human preferences

Breakthrough Assessment

8/10

Significant step for RLHF transparency. Successfully bridges mechanistic interpretability (SAEs) with practical reward modeling, offering both explanation and control without needing expensive multidimensional labels.

⚙️ Technical Details

Problem Definition

Setting: Reward Modeling for RLHF, specifically estimating a scalar reward r(x, y) representing human preference for response y given input x

Inputs: Input context x and candidate response y

Outputs: Scalar reward score r

Pipeline Flow

LLM Backbone (Process input sequence)
Feature Extraction (Extract last token hidden state at layer L)
Sparse Autoencoder (Project hidden state to sparse feature space)
Linear Value Head (Aggregate sparse features into scalar reward)

System Modules

LLM Backbone

Encodes the input text into semantic hidden states

Model or implementation: Llama-3 (8B)

Sparse Autoencoder (SAE)

Decomposes dense hidden states into sparse, interpretable features

Model or implementation: TopK SAE (M latent dimension, K active features)

Value Head

Computes the final reward score as a linear combination of active features

Model or implementation: Linear layer with learnable weights w

Novel Architectural Elements

Integration of a frozen, pretrained SAE into the intermediate layer of a Reward Model
Replacement of the standard MLP value head with a linear aggregation over SAE features, enforcing feature-level explainability

Modeling

Base Model: Llama-3-8B

Training Method: Two-stage training: (1) Unsupervised SAE pretraining, (2) Supervised Reward Modeling

Objective Functions:

Purpose: SAE Reconstruction.

Formally: Minimize ||x - D(E(x))||^2 subject to sparsity constraints (TopK)
Purpose: Reward Modeling.

Formally: Minimize Bradley-Terry loss L = -log(sigmoid(r(x, y_c) - r(x, y_r))) where y_c is preferred over y_r

Training Data:

SAE Pretraining: 50M sequences (~1B tokens) from OpenWebText2
RM Training: Skywork-Reward-Preference-80K-v0.2 dataset

Key Hyperparameters:

sae_expansion_factor: 16x hidden size
sae_sparsity_k: 3/64 of hidden size
sae_learning_rate: 5e-4
+ 5 more
sae_optimizer: Adam (beta1=0.9, beta2=0.999)
rm_batch_size: 512
rm_learning_rate: 4e-6
rm_epochs: 3
rm_layer_depth: Middle layer (1/2 depth of backbone)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ArmoRM/HelpSteer2: SARM achieves interpretability without expensive multidimensional annotation costs
vs. Vanilla Scalar RM: SARM provides feature-level explanations and steerability, whereas scalar RMs are black boxes
vs. Dictionary Learning (Vanilla SAE) [not cited in paper]: SARM applies SAEs specifically to the Reward Modeling task rather than just general LLM interpretability

Limitations

Dead latents in SAE result in fewer usable features than the theoretical maximum dimension M
Manual interpretation of features via GPT-4o is still required to assign semantic labels to the learned features
The approach relies on the quality of the pretrained SAE; if the SAE fails to capture relevant concepts, the RM performance may suffer

Reproducibility

Code: https://github.com/schrieffer-z/sarm

Code is publicly available at https://github.com/schrieffer-z/sarm. The paper specifies datasets (OpenWebText2, Skywork-Reward-Preference-80K) and base models (Llama-3-8B), enabling replication of the pipeline.

📊 Experiments & Results

Evaluation Setup

Pairwise preference prediction on standard benchmarks

Benchmarks:

RewardBench 2 (Preference evaluation across Chat, Chat-Hard, Safety, and Reasoning)

Metrics:

Accuracy (Preference prediction)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RewardBench 2	Overall Score	82.6	Not reported in the paper	-

Experiment Figures

Comparison between Scalar RMs, Multidimensional RMs, and SARM regarding interpretability and annotation cost.

Main Takeaways

SARM successfully extracts human-interpretable features (e.g., 'calculations', 'ethical considerations') from reward model activations.
Modulating the weights of specific features in the value head directly shifts the reward distribution for relevant inputs (e.g., increasing safety feature weight increases rewards for safe responses), proving controllability.
The method offers a trade-off: it gains significant interpretability and steerability without degrading alignment performance compared to black-box baselines (qualitative result).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry preference model
Autoencoders and Sparse Dictionary Learning
Transformer architecture basics

Key Terms

RLHF: Reinforcement Learning from Human Feedback—a method to train AI models using human preferences as a reward signal

SAE: Sparse Autoencoder—a neural network trained to decompose dense activations into a sparse set of interpretable features

Reward Model: A model trained to predict human preferences, outputting a score used to guide the generation policy

Monosemanticity: The property where a single neuron or feature corresponds to a single, distinct concept (e.g., 'references to code' or 'angry tone')

Dead Latents: Neurons in the autoencoder that never activate significantly during training or inference

JumpReLU: A specific activation function used in some SAEs (though this paper uses TopK) that zeros out values below a threshold

TopK SAE: A type of Sparse Autoencoder that enforces sparsity by keeping only the K highest activation values and setting the rest to zero