International Conference on Machine Learning
(2025)
ReasoningRLBenchmark
📝 Paper Summary
Speculative DecodingEfficient LLM Inference
RSD accelerates LLM reasoning by accepting draft steps that have high reward scores, allowing a controlled bias towards correct answers even if they don't match the target model's exact distribution.
Core Problem
Standard speculative decoding strictly enforces unbiasedness, rejecting valid draft tokens if they don't match the target model's distribution, which wastes computation in complex reasoning tasks where diverse correct paths exist.
Why it matters:
Strict unbiasedness forces the rejection of high-quality draft outputs simply because they differ from the target model's preference, negating efficiency gains.
Long-horizon reasoning tasks (like math or coding) generate many tokens, making inference costs prohibitively high without efficient acceleration.
Existing parallel decoding methods struggle to balance the trade-off between speed and the rigorous accuracy required for multi-step reasoning.
Concrete Example:In a math problem, a small draft model might generate a valid correct step that the large model didn't prioritize (low probability). Standard speculative decoding would reject this valid step to maintain distribution matching, forcing a costly regeneration. RSD detects the step is 'correct' via a reward model and accepts it, saving compute.
Key Novelty
Reward-Guided Speculative Decoding (RSD)
Replaces the strict probability-matching acceptance criterion of standard speculative decoding with a reward-based criterion.
Uses a process reward model to evaluate draft steps; if a step's reward is above a dynamic threshold, it is accepted regardless of the target model's probability distribution.
Constructs a theoretical mixture distribution that shifts weight between the cheap draft model (for high-reward steps) and the expensive target model (for low-reward steps).
Architecture
Overview of the Reward-Guided Speculative Decoding framework. It illustrates the timeline of generating steps: the draft model proposes a step, a reward model evaluates it, and depending on the score, it is either accepted (green path) or rejected and regenerated by the target model (red path).
Evaluation Highlights
Achieves up to 4.4× fewer FLOPs compared to decoding with the target model alone on reasoning benchmarks.
Improves reasoning accuracy by up to +3.5 points on average compared to standard speculative decoding (SD) while being more efficient.
Outperforms standard decoding on hard tasks: +1.6% accuracy on MATH500 using Llama-3-8B-Instruct as draft and Llama-3-70B-Instruct as target.
Breakthrough Assessment
8/10
Significantly relaxes the 'unbiased' constraint of speculative decoding in a theoretically grounded way, unlocking speedups for reasoning tasks where strict distribution matching is less important than correctness.
⚙️ Technical Details
Problem Definition
Setting: Iterative generation of reasoning steps y_1:n given input x, using a draft model m and target model M.
Inputs: Prompt x and previous steps y_{1:i-1}.
Outputs: Next step y_i sampled from a dynamic mixture distribution P_RSD.
Pipeline Flow
Draft Model Generation
Reward Evaluation
Acceptance Decision (Thresholding)
Target Model Fallback (if rejected)
System Modules
Draft Model (Generation)
Generates a candidate reasoning step (proposal)
Model or implementation: Smaller LLM (e.g., Llama-3-8B-Instruct)
Process Reward Model
Scores the quality of the candidate step to determine acceptance
Model or implementation: Reward model (e.g., Llama-3-8B-PRM)
Acceptance Mechanism
Decides whether to keep the draft step based on reward score and threshold
Model or implementation: Threshold function (binary or probabilistic)
Target Model (Generation)
Generates the step if the draft was rejected (fallback)
Model or implementation: Larger LLM (e.g., Llama-3-70B-Instruct)
Step-level speculative decoding (operating on full reasoning steps/lines) rather than token-level
Modeling
Base Model: Evaluated with Llama-3-8B-Instruct (draft) and Llama-3-70B-Instruct (target).
Training Method: Inference-time algorithm (no training described in paper text provided)
Compute: Inference efficiency measured in FLOPs reduction (up to 4.4x) and latency.
Comparison to Prior Work
vs. Speculative Decoding: RSD allows 'biased' outcomes favored by a reward model, whereas SD enforces strict distributional equivalence to the target model.
vs. Biased SD: RSD uses an explicit reward model to guide the bias towards correctness, rather than just relaxing probability thresholds arbitrarily.
vs. Multi-Candidate Speculative Decoding [not cited in paper]: RSD focuses on single-path step acceptance via reward, whereas multi-candidate methods generate trees/batches to find the best match.
Limitations
Relies heavily on the quality of the Process Reward Model; a poor reward model could accept incorrect steps.
Step-level generation assumes the text can be naturally segmented into steps (e.g., newlines in math), which may not apply to all domains.
Requires running a separate reward model (or using the draft model as one), adding some computational overhead compared to pure draft generation.
Code is publicly available at https://github.com/BaohaoLiao/RSD. The paper uses standard open models (Llama-3 family) and benchmark datasets (GSM8K, MATH500, etc.).
📊 Experiments & Results
Evaluation Setup
Reasoning tasks where the model generates multi-step solutions.
Benchmarks:
GSM8K (Grade school math word problems)
MATH500 (Challenging math problems (subset of MATH))
Olympiad Bench (Olympiad-level math and physics problems)
GPQA (Graduate-level science QA)
MMLU STEM (STEM subset of MMLU)
GaoKao-2023-En (Chinese college entrance exam questions (English version))
Metrics:
Accuracy
FLOPs (Computational Cost)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Average across benchmarks
FLOPs reduction
1.0
0.227
-0.773
Average across benchmarks
Accuracy Gain
0.0
3.5
+3.5
MATH500
Accuracy
Not reported in the paper
Not reported in the paper
Not reported in the paper
Experiment Figures
Comparison of process rewards for correct vs. incorrect answers on MATH500.
Main Takeaways
RSD consistently reduces computational cost (FLOPs) compared to using the large target model alone, making it viable for resource-constrained deployment.
Unlike standard speculative decoding which maintains target model accuracy (unbiased), RSD can actually *improve* accuracy by filtering outputs via the reward model.
The method is robust to distribution shifts between draft and target models because the acceptance relies on reward quality, not probability matching.
📚 Prerequisite Knowledge
Prerequisites
Speculative Decoding (draft/verify paradigm)
Rejection Sampling
Process Reward Models (PRMs) for reasoning
Key Terms
Speculative Decoding: An inference technique where a small model drafts tokens that are then verified in parallel by a large model to speed up generation.
Unbiasedness: A property of standard speculative decoding ensuring the final output distribution is mathematically identical to the target model's distribution.
Process Reward Model: A model trained to score intermediate steps of a reasoning chain (e.g., math solution steps) rather than just the final answer.
FLOPs: Floating Point Operations—a measure of computational work used here to quantify efficiency gains.
Rejection Sampling: A statistical method used to sample from a complex distribution by accepting/rejecting samples from a simpler proposal distribution based on a specific criterion.