RePO: Replay-Enhanced Policy Optimization

📝 Paper Summary

Reinforcement Learning for LLMs Mathematical Reasoning

RePO improves the data efficiency of Group Relative Policy Optimization by supplementing on-policy updates with off-policy samples retrieved from a replay buffer using diverse strategies like recency or reward maximization.

Core Problem

Group Relative Policy Optimization (GRPO) is computationally expensive due to requiring multiple on-policy samples per step and suffers from vanishing gradients when all samples yield identical rewards.

Why it matters:

High computational costs of on-policy sampling limit the scalability of RL for Large Language Models
When an LLM produces samples with uniform rewards (all correct or all incorrect), GRPO estimates zero advantage, providing no learning signal to improve the model
Relying solely on current policy samples leads to data inefficiency and potential overfitting to limited recent experiences

Concrete Example: If a current policy generates 8 outputs for a math problem and all are incorrect (reward 0), the relative advantage for each is 0. The model receives no gradient to correct its behavior, effectively wasting the computational cost of generation.

Key Novelty

Replay-Enhanced Policy Optimization (RePO)

Integrates an off-policy update term into the GRPO objective, allowing the model to learn from previously generated samples stored in a replay buffer
Employ diverse replay strategies (e.g., maximizing reward, variance, or recency) to select the most effective past samples for current optimization
Uses a 'Split' advantage estimation strategy that calculates advantages separately for on-policy and off-policy batches to prevent interference

Architecture

Overview of RePO optimization process involving on-policy and off-policy updates

Evaluation Highlights

+18.4 absolute average accuracy gain on math benchmarks for Qwen2.5-Math-1.5B compared to GRPO
+4.1 absolute average accuracy gain for Qwen3-1.7B compared to GRPO across seven mathematical reasoning benchmarks
Increases effective optimization steps by 48% while only increasing computational cost by 15% on Qwen3-1.7B (normalized against GRPO baseline)

Breakthrough Assessment

7/10

Significant performance gains and efficiency improvements over the state-of-the-art GRPO method for math reasoning. The approach is a logical extension of RL principles (replay buffers) to the specific GRPO setting.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning for optimizing LLM reasoning policies

Inputs: Prompt q (e.g., a math problem)

Outputs: Token sequence o (reasoning steps and final answer)

Pipeline Flow

Prompt Input
LLM Generation (Policy)
Output Evaluation

System Modules

Policy Model

Generate reasoning chains and answers for given prompts

Model or implementation: Qwen2.5-Math or Qwen3 family

Modeling

Base Model: Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen3-1.7B

Training Method: Replay-Enhanced Policy Optimization (RePO)

Objective Functions:

Purpose: Optimize policy using current samples relative to group average.

Formally: Standard GRPO loss L_on.
Purpose: Optimize policy using past samples with importance sampling.

Formally: L_off using importance ratio r = π(o|q)/π_off(o|q) clipped by ε.
Purpose: Combined objective.

Formally: L = L_on + L_off.

Training Data:

Subset of 1024 examples randomly sampled from DeepMath dataset

Key Hyperparameters:

temperature: 0.2
top_p: 0.95
on_policy_samples: 8
+ 1 more
off_policy_samples: 8

Compute: Increases computational cost by ~15% relative to GRPO for equivalent sample settings

Comparison to Prior Work

vs. GRPO: RePO adds off-policy updates from a replay buffer, whereas GRPO is strictly on-policy.
vs. PPO: RePO (like GRPO) does not require a separate value model network, reducing memory overhead [not cited in paper]
vs. Dr. GRPO: RePO is a framework that can be applied on top of Dr. GRPO to further improve performance (shown in Table 2).

Limitations

Optimal replay strategy (Recency vs. Reward-oriented) is model-dependent and requires tuning
Increases computational cost by 15% compared to standard GRPO
Evaluation limited to math and general reasoning benchmarks; not tested on creative writing or coding specifically

Reproducibility

Code: https://github.com/SihengLi99/RePO

Code is publicly available at https://github.com/SihengLi99/RePO. Training uses a specific subset of DeepMath (1024 examples). Hyperparameters for replay strategies are model-specific (Recency for base, Reward-oriented for instruct).

📊 Experiments & Results

Evaluation Setup

Mathematical and general reasoning tasks using open-source LLMs

Benchmarks:

GSM8K (Grade school math)
MATH-500 (Challenging math problems)
OlympiadBench (Olympiad-level math)
MMLU-Pro (General reasoning)

Metrics:

Pass@1 accuracy
Avg@32 (for AIME24, AIME25, AMC)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Computational efficiency analysis showing RePO provides more optimization steps per unit of compute compared to GRPO (normalized comparison).
Analytical Study (Qwen3-1.7B)	Computational Cost (Relative)	100	115	+15
Analytical Study (Qwen3-1.7B)	Effective Optimization Steps (Relative)	100	148	+48

Main Takeaways

RePO consistently outperforms GRPO across all tested models (Qwen2.5/Qwen3) and benchmarks, with gains up to 18.4 points on Qwen2.5-Math-1.5B.
The 'Split' strategy (estimating on-policy and off-policy advantages separately) significantly outperforms the 'Mixed' strategy, suggesting that separating these distributions reduces interference.
Replay strategies are model-dependent: 'Recency-based' works best for base models to prevent drift, while 'Reward-oriented' works best for instruct models to reinforce desirable behaviors.
RePO generalizes well to other algorithms, providing additive gains when applied on top of Dr. GRPO.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning fundamentals (Policy Gradient, Importance Sampling)
Large Language Models training
Proximal Policy Optimization (PPO)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt against the group average, avoiding a separate value model

Replay Buffer: A storage mechanism that saves past experiences (prompts, outputs, probabilities) to be reused for off-policy training

On-policy: Learning updates computed using data generated by the current version of the model policy

Off-policy: Learning updates computed using data generated by previous versions of the model policy (retrieved from a buffer)

Importance Sampling: A technique to estimate properties of a target distribution using samples from a different distribution, reweighting them by the ratio of their probabilities

KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a reference distribution

Pass@1: An evaluation metric measuring the percentage of problems where the model's first generated answer is correct