Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

📝 Paper Summary

Test-Time Scaling (TTS) Reasoning Process Reward Models (PRMs)

By integrating reward signals into compute-optimal scaling strategies and using absolute difficulty thresholds, significantly smaller models (e.g., 3B) can outperform massive models (e.g., 405B) on complex reasoning tasks.

Core Problem

Current Test-Time Scaling (TTS) methods lack systematic analysis of how policy models, Process Reward Models (PRMs), and problem difficulty interact, often using ineffective difficulty metrics (quantiles) and ignoring reward influence during optimization.

Why it matters:

Blindly scaling compute at test time is inefficient if the search strategy doesn't adapt to the specific model's capability and the problem's hardness
Existing approaches often rely on offline PRMs that suffer from distribution shift, leading to sub-optimal compute allocation
Understanding these dynamics allows smaller, more efficient models to match or beat state-of-the-art closed-source models (like o1) without massive pre-training costs

Concrete Example: When using Llama-3.1-8B with a PRM trained on Mistral (RLHFlow-Mistral), the PRM erroneously assigns high rewards to short, incorrect responses due to distribution shift. A standard TTS strategy might select these short answers, whereas a 'reward-aware' strategy would adjust the search budget or method to account for this bias.

Key Novelty

Reward-Aware Compute-Optimal Test-Time Scaling

Formulates the scaling strategy optimization to explicitly condition on the reward function (PRM), not just the policy model and compute budget
Replaces relative difficulty metrics (quantiles) with absolute accuracy thresholds to better categorize problems across models with vastly different baseline capabilities
Demonstrates that the optimal search method (Best-of-N vs. Beam Search vs. Tree Search) flips depending on model size: small models need step-by-step verification, while large models often do better with simple sampling

Architecture

Schematic of the three Test-Time Scaling (TTS) methods evaluated: Best-of-N, Beam Search, and Diverse Verifier Tree Search (DVTS).

Evaluation Highlights

A 3B parameter model (Qwen2.5-Math) surpasses a 405B parameter model (Llama-3.1) on the MATH-500 benchmark using the proposed compute-optimal TTS
A 7B model outperforms both OpenAI o1 and DeepSeek-R1 on MATH-500 and AIME24 tasks while maintaining higher inference efficiency
A 1B model surpasses a 405B model on MATH-500 when using the optimal combination of Policy, PRM, and search strategy

Breakthrough Assessment

8/10

Strong empirical evidence challenging the 'bigger is better' dogma by showing massive efficiency gains via intelligent inference scaling. systematically analyzes the interaction between PRMs and Policies.

⚙️ Technical Details

Problem Definition

Setting: Reasoning as a Markov Decision Process (MDP) tuple (S, A, P, R, gamma)

Inputs: Prompt x (initial state s1)

Outputs: Final answer derived from trajectory tau = {a1, a2, ..., aH}

Pipeline Flow

Input Prompt -> Policy Model -> [Scaling Strategy] -> Process Reward Model (Verifier) -> Selected Output

System Modules

Policy Model

Generates reasoning steps or full trajectories based on the prompt

Model or implementation: Llama 3 (8B, 70B, 405B) or Qwen2.5 (0.5B to 72B) variants

Process Reward Model (PRM)

Scores individual steps (step-level reward) or full trajectories to guide the search

Model or implementation: Various: Math-Shepherd, RLHFlow (Mistral/Deepseek), Skywork, Qwen2.5-Math-PRM

Scaling Strategy

Determines how to generate and select answers (Best-of-N, Beam Search, DVTS)

Model or implementation: Algorithm (Non-learnable)

Novel Architectural Elements

Reward-Aware Compute-Optimal Scaling: Modifies the target optimization objective Target(theta, N, x) to Target(theta, N, x, R), explicitly including the reward function R as a variable in strategy selection

Modeling

Base Model: Evaluates multiple base models: Llama-3-Instruct series and Qwen2.5-Instruct series

Comparison to Prior Work

vs. Snell et al. (2024): This paper introduces 'Reward-Aware' scaling (integrating PRM choice into optimization) and uses absolute difficulty thresholds instead of relative quantiles. It also evaluates cross-model PRM generalization (offline setting).
vs. OpenAI o1 / DeepSeek-R1: Achieves superior performance with significantly smaller models (7B) via explicit search strategies rather than internal chain-of-thought training [compared as baselines in paper]

Limitations

PRM generalization is poor; verifiers trained on one model often fail to score another model accurately (OOD issues).
Compute-optimal strategies are highly sensitive to the specific combination of Policy, PRM, and Difficulty, requiring extensive tuning.
Training custom PRMs for every policy model is computationally expensive (O(N^2) complexity if done fully).

Reproducibility

Code: https://github.com/openreasoner/openr

Codebase available at https://github.com/openreasoner/openr. Uses public datasets (MATH-500, AIME24) and open-source models (Llama 3, Qwen2.5, Skywork PRMs, etc.). Exact prompt templates and scripts for reproducing the specific 'compute-optimal' curves are implied to be in the OpenR framework.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on competition-level problems

Benchmarks:

MATH-500 (Mathematical Problem Solving)
AIME24 (High-School Mathematics Competition)

Metrics:

Pass@1 Accuracy
Pass@k Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of PRM bias shows that different PRMs drastically alter the generation length and efficiency even under the same compute budget, influencing optimal strategy selection.
Training Data Analysis	Average Tokens per Response	236.8	569.2	+332.4
Evaluation of voting methods shows that higher-quality PRMs (Skywork) benefit from voting mechanisms (PRM-Vote), while others are less sensitive.
MATH-500	Pass@1	90.6	92.0	+1.4
MATH-500	Pass@1	90.0	90.2	+0.2

Experiment Figures

Performance of Llama-3.1-8B-Instruct using different search strategies and PRMs across compute budgets.

Optimal TTS methods for Qwen2.5 models of varying sizes (0.5B to 72B).

Main Takeaways

Optimal TTS strategy is model-dependent: Small models (<7B) benefit from Search methods (Beam/DVTS) to verify steps, while Large models (>70B) prefer Best-of-N as they have strong intrinsic reasoning but need diversity.
Reward-Awareness is critical: The choice of PRM dictates the optimal compute budget and search method. PRMs with higher process supervision ability (fitted as Y=7.66log(X)+44.31) yield better TTS performance.
Absolute Difficulty matters: Using absolute accuracy thresholds (Easy 50-100%, Medium 10-50%, Hard 0-10%) is more effective for scaling analysis than relative quantiles, which skew results for capable models.
Smaller models can punch up: A 3B model with optimal TTS beats a 405B model, and a 7B model beats o1/DeepSeek-R1, proving inference compute can substitute for parameter count.

📚 Prerequisite Knowledge

Prerequisites

Markov Decision Process (MDP)
Test-Time Scaling (TTS)
Process Reward Models (PRM)
Beam Search

Key Terms

Test-Time Scaling (TTS): Improving model performance by increasing computation during inference (e.g., generating more samples or searching deeper) rather than training larger models

Process Reward Model (PRM): A model that evaluates and scores intermediate steps of a reasoning chain, rather than just the final answer

Best-of-N (BoN): A sampling strategy where the model generates N complete solutions, and the best one is selected based on a scoring function

DVTS: Diverse Verifier Tree Search—an extension of beam search that explores independent subtrees to increase solution diversity

RLHFlow: A series of open-source Process Reward Models trained on mathematical reasoning data

MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker

Pass@1: The probability that the model generates a correct answer in a single attempt

OOD: Out-of-Distribution—when a model encounters data significantly different from what it was trained on