GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

📝 Paper Summary

Test-time alignment Reward modeling Controlled decoding

GenARM utilizes a novel Autoregressive Reward Model that predicts dense next-token rewards to efficiently guide frozen LLMs toward human preferences without expensive full-sequence rollouts.

Core Problem

Existing test-time alignment methods rely on trajectory-level reward models that only score complete responses, making next-token selection either inaccurate (if applied to partial text) or computationally prohibitive (if requiring full rollouts).

Why it matters:

Training-time alignment (like RLHF/DPO) is expensive and rigid, requiring retraining for new preferences.
Prior test-time methods like Transfer-Q are too slow for real-time applications because they must simulate complete futures for every token decision.
Naive application of standard reward models to partial sentences (ARGS) leads to poor guidance and gibberish generation.

Concrete Example: When generating 'He decided to...', a standard reward model cannot score the next token 'smile' without seeing the full sentence 'He decided to smile at the neighbor.' To pick the best token, prior methods must generate full completions for 'smile', 'steal', 'wait', etc., which is extremely slow.

Key Novelty

Autoregressive Reward Model (Autoregressive RM)

Parametrizes the reward function as a log-probability distribution, decomposing the total reward into a sum of token-level conditional log-probabilities.
This allows the reward model to act like a language model, providing immediate 'dense' rewards for every next-token candidate without needing to see the future.

Architecture

Comparison of next-token selection between Trajectory-level RM (top) and GenARM/Autoregressive RM (bottom).

Evaluation Highlights

Outperforms test-time baseline ARGS by a wide margin (65.33% win rate) on HH-RLHF while maintaining comparable inference speed (7.28s vs 7.74s).
Matches the performance of training-time alignment (DPO) with a 48.00% win rate against it, without requiring any gradient updates to the base LLM.
Enables weak-to-strong generalization: a 7B Autoregressive RM successfully guides a frozen 70B LLM, recovering >70% of the performance gap between the base model and a fully trained DPO model.

Breakthrough Assessment

8/10

Offers a theoretically grounded and highly efficient solution to test-time alignment, effectively bridging the gap between expensive training-based methods and slow or inaccurate inference-based methods.

⚙️ Technical Details

Problem Definition

Setting: KL-regularized reinforcement learning framework applied at test time via controlled decoding.

Inputs: Prompt x, Frozen Base LLM π_base, Frozen Autoregressive RM π_r

Outputs: Aligned response y generated autoregressively

Pipeline Flow

Input Processing (Prompt x)
Logit Computation (Base LLM & Autoregressive RM in parallel)
Logit Combination (Fusion)
Sampling (Next token selection)

System Modules

Base LLM (Logit Computation)

Provides the foundational language capabilities and base probability distribution

Model or implementation: Frozen LLM (e.g., LLaMA-7B, Tulu2-70B)

Autoregressive RM (Logit Computation)

Provides the preference guidance via next-token reward prediction

Model or implementation: Fine-tuned LLM (parameterized as log-probability)

Sampler

Combines base and reward logits to sample the next token

Model or implementation: Equation: log π_decode = log π_base + (1/β) * log π_r

Novel Architectural Elements

Use of a reward model parameterized explicitly as a conditional log-probability distribution (Autoregressive RM) to enable token-level fusion with base LLM logits.

Modeling

Base Model: LLaMA-7B-SFT, Tulu2 (7B, 13B, 70B), Alpaca-7B, Alpaca-65B

Training Method: Supervised Fine-tuning (for RM) on preference data

Objective Functions:

Purpose: Train Autoregressive RM to assign higher cumulative token probabilities to preferred responses.

Formally: -E[log σ(β_r * (log π_r(y_w) - log π_r(y_l)))]

Adaptation: LoRA (Low-Rank Adaptation) used for training RMs and DPO baselines

Trainable Parameters: Autoregressive RM weights (Base LLM is frozen)

Training Data:

HH-RLHF dataset
UltraFeedback (binarized)
PKU-SafeRLHF-10K

Key Hyperparameters:

beta_r: 0.05
learning_rate: 5e-4 (7B RM), 5e-7 (Tulu2 RM)
beta_decoding: 1.0
+ 1 more
batch_size: Not reported in the paper

Compute: Single NVIDIA RTX A6000 GPU for inference experiments.

Comparison to Prior Work

vs. ARGS: GenARM uses a specifically trained token-level RM rather than forcing a trajectory RM to score partial text.
vs. Transfer-Q: GenARM computes rewards in one forward pass per token, whereas Transfer-Q requires generating multiple full/partial completions per token.
vs. DPO: GenARM aligns at test-time with frozen weights, enabling multi-objective flexibility without retraining.
+ 1 more
vs. Value Augmented Sampling [not cited in paper]: GenARM parameterizes reward as a log-prob rather than learning a separate value function.

Limitations

Inference cost is roughly doubled compared to base LLM (requires forward pass of both Base LLM and RM).
Depends on the availability of preference data to train the Autoregressive RM.
The approach assumes the Autoregressive RM generalizes well to the base LLM's distribution.

Reproducibility

Code: https://genarm.github.io

Project page available at https://genarm.github.io. Base models (LLaMA-7B-SFT, Tulu2) and datasets (HH-RLHF, UltraFeedback) are public. Code for baselines (ARGS) is linked. Autoregressive RM training details provided (beta_r, lr).

📊 Experiments & Results

Evaluation Setup

Open-ended generation on prompts from HH-RLHF, UltraFeedback, and PKU-SafeRLHF.

Benchmarks:

HH-RLHF (Helpfulness and Harmlessness dialogue generation)
AlpacaEval 2 (General instruction following)
PKU-SafeRLHF (Safety alignment (Helpfulness vs Harmlessness))

Metrics:

GPT-4 Win Rate
Length-Controlled Win Rate (AlpacaEval 2)
Inference Time (seconds)
Statistical methodology: Standard error reported for win rates (e.g., ±0.33).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Head-to-head comparisons on HH-RLHF dataset showing GenARM's dominance over test-time baselines and parity with training-time DPO.
HH-RLHF (GPT-4 Eval)	Win Rate vs DPO	50.00	48.00	-2.00
HH-RLHF (GPT-4 Eval)	Win Rate vs ARGS	50.00	65.33	+15.33
HH-RLHF (GPT-4 Eval)	Win Rate vs Transfer-Q	50.00	66.22	+16.22
HH-RLHF generation	Inference Time (seconds per 128 tokens)	130.53	7.28	-123.25
HH-RLHF generation	Inference Time (seconds per 128 tokens)	7.74	7.28	-0.46

Experiment Figures

Heatmap visualization of token-level rewards assigned by the Autoregressive RM to harmless vs. harmful responses.

Pareto frontier of Helpfulness vs. Harmlessness for multi-objective alignment.

Main Takeaways

GenARM bridges the gap between test-time and training-time alignment, matching DPO performance while keeping the base model frozen.
The Autoregressive RM allows for efficient weak-to-strong supervision; a 7B RM can significantly improve a 70B base model (recovering >70% of the DPO gap), avoiding the cost of training the 70B model.
Supports multi-objective alignment (e.g., balancing helpfulness and harmlessness) at inference time by simply adjusting scalar weights for different RMs, which training-time methods cannot do without retraining.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Language Modeling (Next-token prediction)
Bradley-Terry Preference Model

Key Terms

Autoregressive RM: A reward model trained to predict the reward of the next token given the history, parameterized as a log-probability distribution.

Trajectory-level RM: A standard reward model that assigns a scalar score only to a complete text sequence, often failing to accurately score partial sequences.

Test-time alignment: Aligning an LLM's output to preferences during inference (decoding) without updating the model's weights.

Weak-to-strong generalization: Using a smaller, weaker model (e.g., 7B RM) to supervise or guide a larger, stronger model (e.g., 70B LLM).

DPO: Direct Preference Optimization—a training-time method that fine-tunes LLMs on preference pairs without an explicit reward model.

KL divergence: A measure of difference between two probability distributions; used here to ensure the aligned model doesn't drift too far from the base model's capabilities.

SFT: Supervised Fine-Tuning—the initial training phase of an LLM on high-quality instruction data.