Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

📝 Paper Summary

Inference-time compute Process Reward Models Particle Filtering

This paper analyzes inference-time LLM steering through particle filtering theory, establishing bounds for Sequential Monte Carlo and proving fundamental limits on the number of particles required for myopic methods.

Core Problem

Inference-time interventions like aggregation and pruning are currently ad hoc, lacking a principled theoretical framework to explain their accuracy-cost tradeoffs or guide algorithm design.

Why it matters:

Current methods like Best-of-N or Tree Search are empirically effective but theoretically opaque, preventing systematic improvement
We lack understanding of how properties of the Process Reward Model (PRM), such as action-level coverage or accuracy, quantitatively impact final generation quality
Standard particle filtering methods can fail even with perfect reward models if not carefully tuned, wasting computational resources

Concrete Example: Even if a Process Reward Model (PRM) is perfectly accurate, standard Sequential Monte Carlo (SMC) theoretically requires roughly square-root-of-horizon particles to succeed due to interference between particles during resampling. A simple single-particle greedy method would succeed here, highlighting a pathology in standard parallel approaches.

Key Novelty

Particle Filtering Framework for LLM Inference

Models guided generation as a sampling problem from a target distribution defined by the base model and a process reward function
Applies Sequential Monte Carlo (SMC) theory to derive error bounds based on 'action-level coverage' (how well the base model covers high-reward tokens) and PRM accuracy
Introduces SMC with Rejection Sampling (SMC-RS) to fix theoretical pathologies where standard SMC fails despite having good reward models

Architecture

Pseudocode for the Sequential Monte Carlo (SMC) algorithm adapted for language models

Evaluation Highlights

SMC with N=32 particles consistently improves accuracy over Best-of-N (with N=32) across the vast majority of problems in the Math500 benchmark
Theoretical proof establishes that any myopic particle filtering method requires at least logarithmic particles (in horizon H) to maintain non-trivial coverage, defining a fundamental efficiency limit
Empirical experiments on a prompt-switching task validate that sampling error correlates strongly with theoretical predictors like action-level coverage and KL-divergence of the PRM

Breakthrough Assessment

7/10

Provides a rigorous theoretical foundation for widely used heuristic methods. While the algorithmic improvements (SMC-RS) are theoretical, the formalization of 'myopic limits' is a significant contribution to understanding inference scaling.

⚙️ Technical Details

Problem Definition

Setting: Sampling from a tilted distribution π*_H(a_1:H) ∝ π_ref(a_1:H)r*(a_1:H) given a base model π_ref and a process reward model V_hat

Inputs: Prompt/prefix a_1:h, Base Model π_ref, Process Reward Model V_hat

Outputs: Sampled sequence a_1:H (typically a solution to a reasoning problem)

Pipeline Flow

Initialization: Start with N particles (empty sequences)
Proposal: Extend each particle by sampling tokens from base model π_ref
Weighting: Score extensions using the PRM V_hat to calculate importance weights
Resampling: Select particles to keep based on weights (Reject/Resample)
Repeat: Continue until horizon H is reached

System Modules

Base Model (π_ref)

Propose next tokens for the particles

Model or implementation: Qwen2.5-1B-Instruct (Math exp) or Qwen3-0.6B (Prompt exp)

Process Reward Model (V_hat)

Estimate the value (expected future reward) of partial generations

Model or implementation: Qwen2.5-Math-PRM-7B (Math exp)

Resampler

Prune low-weight particles and replicate high-weight particles

Model or implementation: Algorithm 1 (SMC) or Algorithm 2 (SMC-RS)

Novel Architectural Elements

SMC-RS: Incorporates a rejection sampling step inside the particle transition to decouple particles, allowing for o(1) error with O(1) particles when the PRM is near-perfect (fixing a flaw in standard SMC)

Modeling

Base Model: Qwen2.5-1B-Instruct (Math experiments), Qwen3-0.6B (Prompt-switching experiments)

Training Method: Inference-time algorithm (no training involved)

Compute: SMC runtime is O(N*H) where N is number of particles and H is horizon

Comparison to Prior Work

vs. Best-of-N: SMC interacts with the PRM at every step (sequential) rather than just at the end, allowing early pruning of bad paths
vs. VGB: SMC is a parallel algorithm (O(H) parallel time) while VGB is inherently sequential (O(H^2) time); SMC analysis strengthens VGB guarantees [not cited in paper - VGB is cited, but comparison is internal]
vs. Standard SMC: Proposed SMC-RS handles the 'perfect PRM' case more efficiently, requiring fewer particles for low-error regimes

Limitations

Theoretical bounds depend on Chi-square divergence, which can be sensitive to tail behavior (though relaxed to coverage in Theorem 3.5)
Superlinear work (poly(H) particles) is proven necessary for myopic methods, implying a fundamental efficiency barrier
Empirical disconnect: On math tasks, higher PRM error (divergence) sometimes correlated with *higher* accuracy, suggesting the theoretical framework doesn't fully capture 'correctness' vs 'distributional fit'

Reproducibility

Code availability is not provided in the paper. The paper relies on public models (Qwen series) and standard datasets (Math500, AIME). The experiments involve custom implementations of SMC algorithms described in Algorithms 1 and 2.

📊 Experiments & Results

Evaluation Setup

Evaluation of sampling algorithms on constrained generation tasks

Benchmarks:

Math500 (Mathematical Reasoning)
AIME (Mathematical Reasoning (Competition Level))
Prompt-Switching Task (Controlled Synthetic Generation) [New]

Metrics:

Accuracy (for math tasks)
P-logprob discrepancy (LPDP) (for distributional closeness)
KL divergence (proxy for action-level coverage)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Math500	Accuracy vs Best-of-N	See Note	See Note	Positive
Prompt-Switching	Sampling Error (LPDP)	Not applicable	Strong Correlation	Not applicable
Theoretical Analysis	Minimum Particles	1	Ω(log H / log log H)	N/A

Experiment Figures

Scatter plot comparing SMC (y-axis) vs Best-of-N (x-axis) accuracy on Math500 problems

Main Takeaways

SMC improves upon Best-of-N in a non-average sense: it wins on the majority of individual problems in Math500, not just in aggregate metrics
Theoretical predictors (action-level coverage, PRM divergence) accurately predict SMC sampling error in controlled settings (Prompt-Switching), validating the derived bounds
A paradox exists in real-world Math tasks: models with higher theoretical error (Chi-square divergence) sometimes yield higher accuracy, suggesting that optimizing for distributional match is distinct from optimizing for correctness
Myopic methods face a fundamental wall: they cannot be fully efficient (O(1) particles) without lookahead, requiring at least logarithmic particle counts to handle error propagation

📚 Prerequisite Knowledge

Prerequisites

Sequential Monte Carlo (SMC) / Particle Filtering
Process Reward Models (PRMs)
Markov Chains
Kullback-Leibler (KL) and Chi-square divergences

Key Terms

SMC: Sequential Monte Carlo—a class of algorithms used to estimate states of a system by maintaining a set of 'particles' (samples) that are resampled based on weights

PRM: Process Reward Model—a model that estimates the expected future reward (or correctness) of a partial text generation

Myopic method: An algorithm that determines the set of particles at step h using only information available up to step h, without looking ahead to future steps

Action-level coverage: A measure of how well the base language model's proposal distribution covers the tokens preferred by the optimal target distribution

Particle: A single hypothesis or partial generation sequence maintained by the inference algorithm

Best-of-N: A baseline strategy where N independent solutions are generated and the one with the highest reward score is selected

SMC-RS: Sequential Monte Carlo with Rejection Sampling—a novel variant proposed in this paper that uses rejection sampling within the transition step to avoid inter-particle interference