Data-Efficient RLVR via Off-Policy Influence Guidance

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Data Selection

CROPI accelerates reasoning model training by selecting influential data using an efficient off-policy gradient estimator that avoids costly online sampling.

Core Problem

Applying theoretically grounded influence functions to RLVR is impractical because estimating the gradient of an evolving policy requires computationally prohibitive new rollouts for every data point.

Why it matters:

Current data selection methods rely on heuristics (e.g., difficulty) that lack theoretical guarantees and fail to adapt to the model's changing needs during training
Standard influence estimation requires computing gradients on current policy samples; doing this 'online' for LLMs is too slow and expensive due to inference latency
High-dimensional gradients in large language models create massive storage and computation bottlenecks for influence-based selection

Concrete Example: In a standard RLVR loop, to decide if a math problem is useful for the current policy, one would need to generate multiple solutions (rollouts) to estimate its gradient. Repeating this for a 50k dataset at every training step is computationally impossible.

Key Novelty

Curriculum RL with Off-Policy Influence Guidance (CROPI)

Estimates the 'influence' of data points on the current policy using trajectories collected from an *old* behavior policy (off-policy), eliminating the need for real-time rollouts
Compresses massive gradient vectors using a sparse random projection technique that randomly drops dimensions before projection, efficiently preserving influence scores (inner products) with less noise
Iteratively selects a small subset of data that maximizes predicted influence on a validation set, creating a dynamic curriculum that evolves with the model

Architecture

The CROPI framework workflow: (a) Offline trajectory collection, (b) Sparse random projection of gradients, (c) Off-policy influence estimation, and (d) The curriculum loop.

Evaluation Highlights

Achieves 2.66x step-level acceleration on Qwen2.5-1.5B compared to full-dataset training while using only 10% of the data per stage
Consistently outperforms heuristic baselines (Learnability, Pass Rate) and global influence methods across GSM8K and MATH benchmarks
Demonstrates robust generalization to 'untargeted' tasks (test sets not used for validation data selection)

Breakthrough Assessment

8/10

Successfully adapts influence functions—a rigorous but expensive tool—to the RLVR setting by solving the primary bottleneck (rollout cost). The speedup claims are significant.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) for reasoning where rewards are outcome-based (1 for correct answer, 0 otherwise)

Inputs: Reasoning problem prompts (state s)

Outputs: Chain-of-thought solution and final answer (action sequence a)

Pipeline Flow

Offline Data Collection (Generate trajectories with initial policy)
Off-Policy Influence Estimation (Compute gradients using old trajectories)
Data Selection (Select top data points based on influence on validation set)
Policy Optimization (Train on selected subset)

System Modules

Offline Trajectory Collector

Generate K trajectories for every training prompt using the initial behavior policy

Model or implementation: Initial Policy (beta)

POPI Estimator

Estimate the gradient of the *current* policy using importance sampling on offline trajectories, then compress and compute influence

Model or implementation: Current Policy checkpoint

Policy Trainer

Update the policy using GRPO on the selected subset of data

Model or implementation: Qwen2.5 / DeepSeek-R1-Distill

Novel Architectural Elements

Integration of an off-policy gradient estimator into the data selection loop to decouple influence calculation from real-time generation
Use of sparse random projection specifically for compressing LLM gradients to enable feasible inner-product computation

Modeling

Base Model: Qwen2.5-1.5B-Instruct, Qwen2.5-7B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B

Training Method: Curriculum RL with GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Optimize policy to maximize reward.

Formally: GRPO objective (mean of importance-weighted advantages)
Purpose: Select data.

Formally: Maximize off-policy influence score Inf_beta(pi_theta; s, s') = <g_beta(s), g_beta(s')>

Key Hyperparameters:

selection_ratio_alpha: 0.1
number_of_offline_trajectories_K: Not explicitly reported in the paper (implied standard GRPO setting, likely 4-8)
validation_set_size: max 100 examples

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. Learnability/Pass Rate: CROPI uses gradient-based influence rather than scalar heuristics, providing direction-aware selection
vs. Standard Influence: CROPI uses *off-policy* gradients and sparse projection, making it computationally feasible for LLM RL
vs. LESS [not cited in paper]: LESS uses influence for SFT data selection; CROPI adapts this to RL by handling the trajectory generation bottleneck

Limitations

Relies on the assumption that the initial behavior policy is close enough to the current policy for importance sampling to hold (KL constraint)
Requires a high-quality validation set (though small) which may not always be available for all domains
Off-policy approximation quality degrades as the policy deviates significantly from the initial behavior policy

Reproducibility

No replication artifacts mentioned in the paper (no GitHub link found). Dataset construction details (GSM8K, MATH mixtures) are described. Hyperparameters like selection ratio are provided.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using Open-Source LLMs

Benchmarks:

GSM8K (Grade school math)
MATH (Challenging math problems)
Gaokao2023EN (College entrance exam math)
OlympiadBench (Math olympiad problems)
AMC23/AIME24 (Math competitions)

Metrics:

Pass@1 (Accuracy)
Step-level acceleration (speedup)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average across Targeted Tasks	Step-level Speedup	1.00	2.66	+1.66
CROPI demonstrates superior sample efficiency by achieving better performance with only 10% of the data.
Targeted Tasks	Pass Rate	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Training curves comparing CROPI (using 10% data) vs Full Data vs Baselines on the 1.5B model.

Main Takeaways

Off-policy influence estimation is a viable proxy for on-policy influence, enabling gradient-based data selection in RL without massive compute overhead.
Sparse random projection effectively compresses gradients for influence computation without degrading selection quality, solving the storage bottleneck.
Curriculum-based selection (re-evaluating influence at stages) is crucial because data utility shifts as the policy improves.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients, Importance Sampling)
Influence Functions (for data attribution)
Large Language Models (reasoning pipelines)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training models on tasks where correctness can be automatically checked (e.g., math, code)

Influence Function: A measure of how much a specific training data point contributes to the model's performance on a validation set (usually via gradient inner products)

Off-Policy: Learning or estimating values for a target policy using data generated by a different (behavior) policy

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing scores within a group of outputs for the same prompt, avoiding a separate critic model

Rollout: The process of generating a complete sequence of tokens (a trajectory) from the policy to estimate rewards

Sparse Random Projection: A dimensionality reduction technique where a matrix is projected into a lower-dimensional space using a sparse matrix, preserving distances/angles with high probability

CROPI: Curriculum RL with Off-Policy Influence Guidance—the proposed framework

POPI: Practical Off-Policy Influence estimation—the specific metric used to score data utility