University of Wisconsin-Madison,
Stanford University
International Conference on Learning Representations
(2024)
RLReasoningBenchmark
📝 Paper Summary
LLM AlignmentDecoding StrategiesAI Safety
ARGS aligns language models during decoding by adjusting token probabilities with a reward signal, eliminating the need for expensive reinforcement learning training like PPO.
Core Problem
Standard alignment methods like RLHF with PPO are computationally expensive, unstable to train, and require extensive retraining whenever reward models or objectives change.
Why it matters:
Training instability and high resource costs of PPO limit accessibility for many researchers
Rigid training phases prevent models from rapidly adapting to new safety guidelines or user preferences without full retraining
Misaligned models can generate harmful or unhelpful content, posing safety risks in real-world deployments
Concrete Example:When asked 'Can you help me set up a light show?', a standard greedy decoder might repeat unhelpful clarifying questions. ARGS, guided by a reward model, immediately generates a structured plan with specific equipment and steps.
Key Novelty
Alignment as Reward-Guided Search (ARGS)
Integrates alignment directly into the token decoding process rather than updating model weights via training
Modifies the probability of the next token by combining the base model's likelihood with a weighted signal from a reward model
Treats text generation as a search problem where the objective is to maximize a combined score of semantic coherence and human preference reward
Architecture
Conceptual diagram of the ARGS decoding process at a single time step
Evaluation Highlights
+19.56% improvement in average reward compared to greedy decoding baselines on the HH-RLHF dataset
Achieves a 64.33% win-tie rate against baseline methods in GPT-4 based evaluation for helpfulness and harmlessness
Demonstrates consistent improvements across multiple model architectures (LLaMA-7B, OPT-1.3b, OPT-2.7b) and alignment tasks (HH-RLHF, SHP)
Breakthrough Assessment
7/10
Offers a lightweight, training-free alternative to RLHF. While computationally more expensive at inference time than vanilla decoding, it provides significant flexibility and alignment improvements without unstable PPO training.
⚙️ Technical Details
Problem Definition
Setting: Open-ended text generation aligned with human preferences defined by a reward model
Inputs: Context prompt x_{<t}
Outputs: Next token v that balances language modeling probability and reward maximization
Pipeline Flow
Base LM Prediction (computes logits for next token)
Candidate Selection (selects Top-k tokens)
Reward Evaluation (computes reward for each candidate)
Score Aggregation (combines LM prob and Reward)
Token Selection (selects best token)
System Modules
Base Language Model
Predicts the probability distribution of the next token based on context
Model or implementation: LLaMA-7B-SFT (or OPT variants)
Reward Model
Assigns a scalar reward to the potential continuation formed by appending a candidate token
Model or implementation: Fine-tuned LLaMA-7B or OPT variants (trained on preference data)
Scoring Mechanism
Combines LM probability and reward into a final score
Model or implementation: Analytical formula: score = (1-w)*log(P_LM) + w*Reward
Novel Architectural Elements
Reward-Guided Scoring Function: linearly interpolates between log-likelihood and reward scalar during the decoding loop
Lookahead Reward Calculation: explicitly computes reward model forward passes for top-k candidate tokens at every decoding step
Modeling
Base Model: LLaMA-7B (fine-tuned on preferred HH-RLHF responses)
Training Method: Supervised Fine-Tuning (SFT) + Reward Modeling (RM)
Objective Functions:
Purpose: Train the reward model to assign higher scores to preferred responses.
Formally: Loss = -log(sigmoid(r(x, y_w) - r(x, y_l))) where y_w is preferred over y_l
Adaptation: Full fine-tuning for SFT and Reward Model training
Trainable Parameters: Not explicitly reported in the paper
Code is publicly available at https://github.com/deeplearning-wisc/args. The paper uses open-source models (LLaMA, OPT) and datasets (HH-RLHF, SHP). Hyperparameters for decoding (w, k) are explicitly stated.
📊 Experiments & Results
Evaluation Setup
Dialogue generation evaluated on helpfulness and harmlessness
Benchmarks:
HH-RLHF (Dialogue Preference (Helpful and Harmless))
Stanford Human Preferences (SHP) (General Preference)
Metrics:
Average Reward (using the same RM as decoding)
GPT-4 Win-Tie Rate
Diversity (n-gram repetition)
Coherence (SimCSE cosine similarity)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
HH-RLHF
Average Reward
0.199
0.238
+0.039
HH-RLHF
GPT-4 Win-Tie Rate
35.67
64.33
+28.66
HH-RLHF
Diversity
11.69
24.97
+13.28
HH-RLHF
Coherence
0.78
0.76
-0.02
SHP
GPT-4 Win-Tie Rate
27.67
72.33
+44.66
Experiment Figures
Impact of the reward weight (w) on Average Reward, Diversity, and Coherence
Performance comparison across different model sizes (OPT-1.3b, OPT-2.7b) on the SHP dataset
Main Takeaways
Incorporating reward signals during decoding significantly improves alignment with human preferences without model retraining
The method trades off a small amount of semantic coherence (if weight is too high) for significantly better reward optimization
Performance is robust across different model sizes (OPT-1.3b to LLaMA-7B) and datasets (HH-RLHF, SHP)
Greedy selection within ARGS (ARGS-greedy) generally outperforms stochastic sampling (ARGS-stochastic) for alignment purposes