LLMs Can Learn to Reason Via Off-Policy RL

📝 Paper Summary

Reinforcement Learning for LLMs Post-training Optimization

OAPL is an off-policy RL algorithm that enables reasoning models to learn effectively from lagged inference data by optimizing a regression objective rather than using unstable importance sampling.

Core Problem

Standard RL post-training (like GRPO) assumes on-policy data, but system constraints (distributed lag, different kernels) cause the inference policy to differ from the training policy, breaking this assumption.

Why it matters:

Infrastructure mismatches (e.g., vLLM vs HuggingFace) cause log-probability discrepancies, destabilizing training
Synchronous training is slow; forcing the inference engine to stay perfectly synced with the trainer introduces bottlenecks
Current fixes like Importance Sampling (IS) introduce high variance or require complex heuristics like clipping and token deletion

Concrete Example: In an asynchronous setup, the inference engine generating math solutions might be 400 gradient steps behind the trainer. GRPO would fail or require massive importance sampling corrections because the data is 'stale', whereas OAPL treats this lag as a natural part of the learning process.

Key Novelty

Optimal Advantage-based Policy Optimization with Lagged Inference (OAPL)

Treats the policy mismatch as a KL-regularized RL problem where the optimal solution has a known closed form
Derives a squared regression loss that pulls the training policy toward the optimal policy without needing importance sampling ratios
Intentionally lags the inference policy, syncing it with the trainer only infrequently, to maximize generation throughput without instability

Architecture

The asynchronous training loop of OAPL.

Evaluation Highlights

Matches performance of DeepCoder on LiveCodeBench while using ~3x fewer generations during training
Maintains effective training with policy lags of >400 gradient steps (100x more off-policy than prior approaches)
Outperforms GRPO with Importance Sampling on AIME 2025 and HMMT 2025 math benchmarks

Breakthrough Assessment

8/10

Provides a principled, theoretically grounded solution to the practical 'policy lag' problem in distributed LLM training, eliminating the need for brittle importance sampling heuristics.

⚙️ Technical Details

Problem Definition

Setting: KL-regularized Reinforcement Learning with a lagged sampling policy

Inputs: Prompt x, generated rollout y from lagged policy

Outputs: Optimized Policy π

Pipeline Flow

Weight Synchronization: Trainer → Inference Engine
Inference Engine (Lagged) → Data Generation → Replay Buffer
Trainer → Sampling from Buffer → OAPL Update
Repeat Synchronization every L steps

System Modules

Trainer

Updates the policy parameters using the OAPL objective

Model or implementation: LLM (e.g., HuggingFace model)

Inference Engine

Generates rollouts efficiently using a potentially stale version of weights

Model or implementation: vLLM (lagged copy of Trainer)

Replay Buffer

Stores generated data; cleared upon weight synchronization

Model or implementation: Data Structure

Novel Architectural Elements

Use of a deliberately lagged inference policy (synced only every L steps) as the explicit KL reference in the loss function
Regression-based RL update module replacing the standard PPO/GRPO clipped surrogate objective

Modeling

Base Model: Not explicitly specified in text snippet (DeepSeek-R1 referenced as context)

Training Method: OAPL (Optimal Advantage-based Policy Optimization with Lagged Inference)

Objective Functions:

Purpose: Minimize the squared difference between the policy's log-ratio and the optimal advantage.

Formally: L(θ) = E[(A*(x,y) - ln(π_θ(y|x)) + ln(π_vllm(y|x)))^2]
Purpose: Estimate the optimal value function V* for advantage calculation.

Formally: V^*(x) ≈ (1/β) * ln( (1/G) * Σ exp(β * r(x, y_i)) )

Key Hyperparameters:

L: Synchronization frequency (number of trainer steps between inference updates)
beta: Smoothing parameter for Value estimation

Compute: Not reported in the paper

Comparison to Prior Work

vs. GRPO with IS: OAPL avoids variance-inducing importance ratios and clipping heuristics entirely
vs. A*PO: OAPL treats the reference policy as the dynamic (lagged) inference policy rather than a static pre-trained model
vs. PPO: OAPL uses a regression loss derived from the closed-form solution of KL-regularized RL rather than a surrogate objective

Limitations

Depends on the assumption that the lagged inference policy has non-zero probability of generating correct solutions
Requires clearing the replay buffer at every synchronization step to ensure consistent value estimation

Reproducibility

Code: https://github.com/danieldritter/OAPL

Code is publicly available at https://github.com/danieldritter/OAPL. The paper details the exact loss function and the value estimation method. Datasets (AIME, LiveCodeBench) are public.

📊 Experiments & Results

Evaluation Setup

Post-training of reasoning LLMs on math and code generation tasks

Benchmarks:

AIME 2025 (Competition Math)
HMMT 2025 (Feb/Nov) (Competition Math)
LiveCodeBench v5 (Code Generation)

Metrics:

Pass@k (k=1 to 256)
Statistical methodology: Unbiased estimator for Pass@k (Chen et al., 2021) used with 10 (math) or 20 (code) independent rollouts

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LiveCodeBench v5	Training Generations	3.0	1.0	-2.0
LiveCodeBench v5	Max Policy Lag (Gradient Steps)	4	400	+396

Experiment Figures

Performance comparison on math benchmarks (AIME 25, HMMT 25) using Pass@k metrics.

Main Takeaways

OAPL outperforms GRPO with Importance Sampling across multiple math competition benchmarks (AIME, HMMT, BRUMO).
The method enables stable training even when the inference policy is significantly lagged (up to 400 gradient updates behind), allowing for highly asynchronous and efficient architectures.
Unlike GRPO, OAPL does not suffer from entropy collapse and shows better test-time scaling (Pass@k improvements up to k=256).
Being strictly on-policy is not necessary for effective RL post-training of reasoning models.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
KL Divergence
Importance Sampling

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance

PPO: Proximal Policy Optimization—an RL method that constrains policy updates to prevent destructive large steps

Importance Sampling: A technique to estimate properties of a target distribution using samples from a different distribution by weighting samples by their likelihood ratio

Off-policy RL: Learning an optimal policy using data generated by a different (often older or exploratory) behavior policy

Pass@k: A metric measuring the probability that at least one of k generated solutions is correct

vLLM: A high-throughput library for LLM inference and serving

KL regularization: Adding a penalty term based on the Kullback-Leibler divergence to keep the learned policy close to a reference distribution

Policy Lag: The difference in weights/behavior between the model generating data (inference) and the model being updated (trainer) in asynchronous setups