ARGS: Alignment as Reward-Guided Search

📝 Paper Summary

LLM Alignment Decoding Strategies AI Safety

ARGS aligns language models during decoding by adjusting token probabilities with a reward signal, eliminating the need for expensive reinforcement learning training like PPO.

Core Problem

Standard alignment methods like RLHF with PPO are computationally expensive, unstable to train, and require extensive retraining whenever reward models or objectives change.

Why it matters:

Training instability and high resource costs of PPO limit accessibility for many researchers
Rigid training phases prevent models from rapidly adapting to new safety guidelines or user preferences without full retraining
Misaligned models can generate harmful or unhelpful content, posing safety risks in real-world deployments

Concrete Example: When asked 'Can you help me set up a light show?', a standard greedy decoder might repeat unhelpful clarifying questions. ARGS, guided by a reward model, immediately generates a structured plan with specific equipment and steps.

Key Novelty

Alignment as Reward-Guided Search (ARGS)

Integrates alignment directly into the token decoding process rather than updating model weights via training
Modifies the probability of the next token by combining the base model's likelihood with a weighted signal from a reward model
Treats text generation as a search problem where the objective is to maximize a combined score of semantic coherence and human preference reward

Architecture

Conceptual diagram of the ARGS decoding process at a single time step

Evaluation Highlights

+19.56% improvement in average reward compared to greedy decoding baselines on the HH-RLHF dataset
Achieves a 64.33% win-tie rate against baseline methods in GPT-4 based evaluation for helpfulness and harmlessness
Demonstrates consistent improvements across multiple model architectures (LLaMA-7B, OPT-1.3b, OPT-2.7b) and alignment tasks (HH-RLHF, SHP)

Breakthrough Assessment

7/10

Offers a lightweight, training-free alternative to RLHF. While computationally more expensive at inference time than vanilla decoding, it provides significant flexibility and alignment improvements without unstable PPO training.

⚙️ Technical Details

Problem Definition

Setting: Open-ended text generation aligned with human preferences defined by a reward model

Inputs: Context prompt x_{<t}

Outputs: Next token v that balances language modeling probability and reward maximization

Pipeline Flow

Base LM Prediction (computes logits for next token)
Candidate Selection (selects Top-k tokens)
Reward Evaluation (computes reward for each candidate)
Score Aggregation (combines LM prob and Reward)
Token Selection (selects best token)

System Modules

Base Language Model

Predicts the probability distribution of the next token based on context

Model or implementation: LLaMA-7B-SFT (or OPT variants)

Reward Model

Assigns a scalar reward to the potential continuation formed by appending a candidate token

Model or implementation: Fine-tuned LLaMA-7B or OPT variants (trained on preference data)

Scoring Mechanism

Combines LM probability and reward into a final score

Model or implementation: Analytical formula: score = (1-w)*log(P_LM) + w*Reward

Novel Architectural Elements

Reward-Guided Scoring Function: linearly interpolates between log-likelihood and reward scalar during the decoding loop
Lookahead Reward Calculation: explicitly computes reward model forward passes for top-k candidate tokens at every decoding step

Modeling

Base Model: LLaMA-7B (fine-tuned on preferred HH-RLHF responses)

Training Method: Supervised Fine-Tuning (SFT) + Reward Modeling (RM)

Objective Functions:

Purpose: Train the reward model to assign higher scores to preferred responses.

Formally: Loss = -log(sigmoid(r(x, y_w) - r(x, y_l))) where y_w is preferred over y_l

Adaptation: Full fine-tuning for SFT and Reward Model training

Trainable Parameters: Not explicitly reported in the paper

Training Data:

HH-RLHF: 112k training pairs, 12.5k test pairs
SHP: 349k training pairs, 36.8k test pairs

Key Hyperparameters:

reward_weight_w: 1.5 (LLaMA), 2.0 (OPT)
top_k_candidates: 10
decoding_temperature: 0.7 (for stochastic variants)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PPO: ARGS requires no training updates to the policy model; PPO requires unstable RL training
vs. Greedy Decoding: ARGS incorporates reward signals at every step; Greedy only uses likelihood
vs. Rejection Sampling: ARGS guides generation token-by-token; Rejection sampling filters only after full generation [not cited in paper]

Limitations

Inference latency is higher than standard decoding because it runs the reward model k times per token
Performance depends heavily on the quality and proxy alignment of the Reward Model
Hyperparameters w (weight) and k (candidates) require tuning to balance coherence and reward
Only evaluated on text-based alignment tasks (Helpful/Harmless, SHP), not code or reasoning

Reproducibility

Code: https://github.com/deeplearning-wisc/args

Code is publicly available at https://github.com/deeplearning-wisc/args. The paper uses open-source models (LLaMA, OPT) and datasets (HH-RLHF, SHP). Hyperparameters for decoding (w, k) are explicitly stated.

📊 Experiments & Results

Evaluation Setup

Dialogue generation evaluated on helpfulness and harmlessness

Benchmarks:

HH-RLHF (Dialogue Preference (Helpful and Harmless))
Stanford Human Preferences (SHP) (General Preference)

Metrics:

Average Reward (using the same RM as decoding)
GPT-4 Win-Tie Rate
Diversity (n-gram repetition)
Coherence (SimCSE cosine similarity)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HH-RLHF	Average Reward	0.199	0.238	+0.039
HH-RLHF	GPT-4 Win-Tie Rate	35.67	64.33	+28.66
HH-RLHF	Diversity	11.69	24.97	+13.28
HH-RLHF	Coherence	0.78	0.76	-0.02
SHP	GPT-4 Win-Tie Rate	27.67	72.33	+44.66

Experiment Figures

Impact of the reward weight (w) on Average Reward, Diversity, and Coherence

Performance comparison across different model sizes (OPT-1.3b, OPT-2.7b) on the SHP dataset

Main Takeaways

Incorporating reward signals during decoding significantly improves alignment with human preferences without model retraining
The method trades off a small amount of semantic coherence (if weight is too high) for significantly better reward optimization
Performance is robust across different model sizes (OPT-1.3b to LLaMA-7B) and datasets (HH-RLHF, SHP)
Greedy selection within ARGS (ARGS-greedy) generally outperforms stochastic sampling (ARGS-stochastic) for alignment purposes

📚 Prerequisite Knowledge

Prerequisites

Language Modeling (next-token prediction)
Reinforcement Learning from Human Feedback (RLHF)
Decoding strategies (Greedy, Top-k, Nucleus sampling)

Key Terms

RLHF: Reinforcement Learning with Human Feedback—a method to align models by training them to maximize a reward signal derived from human preferences

PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used in RLHF to update model weights

Reward Model: A model trained to predict a scalar score representing how much a human would prefer a given text response

Decoding-time alignment: Aligning model outputs during the inference phase (generation) rather than during the training phase

Top-k: A decoding strategy that considers only the k most likely next tokens to reduce the search space

Greedy decoding: A strategy that selects the single highest-probability token at each step

HH-RLHF: Helpful and Harmless dataset—a benchmark dataset containing human preference comparisons for dialogue agents

SHP: Stanford Human Preferences dataset—a dataset of naturally occurring human preferences across various domains