Fast Best-of-N Decoding via Speculative Rejection

📝 Paper Summary

Inference-time Alignment Efficient Decoding LLM Safety

Speculative Rejection accelerates Best-of-N alignment by pruning unpromising generation trajectories early based on intermediate reward scores, achieving high-quality results with significantly fewer computational resources.

Core Problem

Best-of-N is a highly effective alignment strategy but is computationally inviable for large N (e.g., >1000) because generating thousands of full responses exhausts GPU memory and compute.

Why it matters:

Inference-time alignment avoids complex post-training (like RLHF/DPO) but requires efficiency to be practical for deployment
Achieving state-of-the-art alignment often requires large N (1000-60000), which normally demands dozens of GPUs
Standard decoding wastes resources completing low-quality responses that could be identified and discarded early

Concrete Example: For the prompt 'How to hack a bank?', a model might start two responses: 'Never do this...' (good) and 'Hackers usually begin...' (bad). Standard Best-of-N generates both fully before scoring, wasting compute on the harmful trajectory. Speculative Rejection identifies the harmful trajectory's low score early and stops it.

Key Novelty

Speculative Rejection

Starts generation with a very large batch size (e.g., 5000) on a single accelerator, which fits in memory only because sequences are short initially
Periodically evaluates partial sequences using the reward model during generation
Dynamically halts (prunes) unpromising trajectories that have low intermediate scores, freeing up memory to continue generating only the high-quality candidates

Architecture

Comparison of memory usage between Best-of-N and Speculative Rejection. Best-of-N has constant memory usage that underutilizes capacity early on. Speculative Rejection starts with a massive batch that fills memory, then drops unpromising candidates (step-down pattern) to prevent overflow as sequence length increases.

Evaluation Highlights

Achieves reward scores comparable to standard Best-of-N running on 16-32 GPUs while using only a single GPU
Maintains generation quality (win-rate) comparable to Best-of-N with N ranging from 120 to 3840
Saves approximately 85.5% of tokens in motivating examples by early-stopping low-quality trajectories

Breakthrough Assessment

8/10

Significantly reduces the barrier to entry for strong inference-time alignment, making Best-of-N competitive with complex post-training methods without the massive hardware cost.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language generation with the goal of maximizing a reward function s(X, Y)

Inputs: Prompt X

Outputs: Response Y selected from generated candidates to maximize reward

Pipeline Flow

Initialization: Start with large batch N
Generation: Auto-regressive token generation
Evaluation: Score partial sequences at decision tokens
Rejection: Prune low-scoring trajectories
Continuation: Continue generation with reduced batch

System Modules

Generator

Generates candidate response tokens auto-regressively

Model or implementation: LLMs (e.g., Llama variants used in AlpacaFarm)

Reward Model

Scores partial and full sequences to guide pruning and final selection

Model or implementation: Reward model compatible with the generator (e.g., PPO-sim based models)

Rejection Mechanism

Determines which sequences to stop based on reward scores and memory constraints

Model or implementation: Heuristic / Algorithm

Novel Architectural Elements

Dynamic batch size adjustment driven by intermediate reward evaluation
Integration of reward model scoring during the generation loop (rather than just at the end)

Modeling

Base Model: Evaluated on AlpacaFarm models (specific variants not detailed in text provided)

Training Method: Inference-time alignment (decoding strategy)

Compute: Single GPU (e.g., A100) for Speculative Rejection vs. 16-32 GPUs for standard Best-of-N at equivalent large N

Comparison to Prior Work

vs. Best-of-N: Dynamically prunes batch size instead of constant batch size
vs. Beam Search: Starts with massive N and selects to complete a fraction, rather than expanding a small set; uses reward model scores rather than just likelihood
vs. Speculative Decoding: Focuses on reward maximization/alignment rather than just latency reduction of the base model

Limitations

Relies on correlation between partial and final reward scores; if correlation is low, good candidates might be pruned
Requires a reward model that can provide meaningful signals on partial sentences
Performance depends on the ability to determine optimal decision tokens (tau)

Reproducibility

Code: https://github.com/Zanette-Labs/SpeculativeRejection

Code is publicly available at https://github.com/Zanette-Labs/SpeculativeRejection. Evaluated on standard AlpacaFarm simulation framework.

📊 Experiments & Results

Evaluation Setup

AlpacaFarm simulation

Benchmarks:

AlpacaFarm (Instruction Following / Alignment Simulation)

Metrics:

Reward Score
Win-rate (vs Best-of-N)
Length-controlled win-rate
Computational Efficiency (GPU count equivalent)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Efficiency comparisons demonstrate that Speculative Rejection drastically reduces hardware requirements for high-N alignment.
AlpacaFarm	GPU Count for Comparable Reward	16	1	-15
Case Study	Tokens Saved	0	85.5	+85.5

Experiment Figures

Scatter plot showing the correlation between partial reward scores (x-axis) and final reward scores (y-axis).

Main Takeaways

Partial reward scores are sufficiently correlated with final scores to serve as effective pruning signals.
Speculative Rejection enables the benefits of massive-N (N > 1000) Best-of-N alignment on a single GPU.
The method is computationally viable where standard Best-of-N is not, offering 16x-32x efficiency gains.

📚 Prerequisite Knowledge

Prerequisites

Autoregressive generation
Best-of-N decoding
Reward Modeling
KL Divergence

Key Terms

Best-of-N: A decoding strategy that generates N independent responses and selects the one with the highest reward model score

Inference-time alignment: Aligning LLM outputs to human preferences during the decoding phase rather than by modifying model weights via training

Speculative Rejection: The proposed method of dynamically pruning generation trajectories that show low intermediate reward scores

Reward Model: A model trained to output a scalar score representing the quality or safety of a text sequence

Decision token: The specific token index (tau) at which the system evaluates partial responses to decide which to prune

Post-training: Alignment steps like RLHF or DPO performed after pre-training to refine model behavior

RLHF: Reinforcement Learning from Human Feedback—a common post-training alignment method

DPO: Direct Preference Optimization—a post-training method that optimizes preferences without explicit reward modeling