CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) LLM Post-training

CoBA-RL optimizes reinforcement learning efficiency by dynamically allocating rollout budgets to tasks based on the model's real-time failure rate, prioritizing consolidation of easy tasks before exploring harder ones.

Core Problem

Standard RL frameworks like GRPO use uniform rollout budgets for all prompts regardless of difficulty, while static adaptive methods fail to account for the model's changing capabilities during training.

Why it matters:

Uniform allocation wastes computational resources on samples that are either too easy (already mastered) or too hard (impossible), reducing training efficiency.
Static difficulty metrics (like historical pass rates) incorrectly assume a task's training value is constant, ignoring that 'difficulty' is relative to the model's current skill level.
Effective reasoning requires balancing exploitation (mastering known tasks) and exploration (finding solutions for hard tasks), which shifts as the model learns.

Concrete Example: In a math training batch, a uniform strategy generates 16 solutions for both a simple '2+2' query and a complex Olympiad problem. The model wastes resources generating 16 correct '2+2' answers (zero marginal gain) while failing to solve the Olympiad problem because 16 attempts were insufficient to find a correct path.

Key Novelty

Capability-Oriented Budget Allocation (CoBA-RL)

Quantifies the 'training value' of each task using a Beta distribution that changes shape based on the model's real-time global failure rate.
Implements an 'Exploit → Explore' strategy: early training prioritizes easy tasks to consolidate basics, while later training shifts resources to harder tasks for exploration.
Uses a heap-based greedy algorithm to efficiently solve the budget allocation problem, maximizing total batch value under a fixed compute budget.

Architecture

The CoBA-RL training workflow, detailing how global failure rates inform the value function which then guides budget allocation.

Evaluation Highlights

+5.62% accuracy improvement on AIME25 with Qwen2.5-7B-Instruct compared to the GRPO baseline.
Achieves higher accuracy (45.52%) with half the budget (2048 rollouts) than GRPO achieves with full budget (4096 rollouts, 42.78%), demonstrating superior data efficiency.
Outperforms static heuristic strategies (Linear Step Decay) and static value functions across multiple math benchmarks.

Breakthrough Assessment

8/10

Significant efficiency gains and strong empirical results on challenging math benchmarks. The dynamic 'Exploit -> Explore' mechanism challenges the conventional uniform or static-difficulty paradigms in RLVR.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) using Group Relative Policy Optimization (GRPO)

Inputs: Batch of training tasks (prompts) X

Outputs: Optimal set of rollout counts B = {B_1, ..., B_M} for each task to maximize aggregate training value

Modeling

Base Model: Qwen2.5-7B-Instruct, Qwen2.5-7B-Base, Qwen3-1.7B-Base, Qwen3-4B-Base

Training Method: Group Relative Policy Optimization (GRPO) with dynamic budget allocation

Objective Functions:

Purpose: Maximize aggregate training value of the batch under a total budget constraint.

Formally: Maximize sum of V(B_i, p_i) subject to sum of B_i = B_total.
Purpose: Define training value based on capability and diminishing returns.

Formally: V(B_i, p_i) = Beta(p_i; alpha_t, beta_t) * (1 - exp(-B_i/tau)).
Purpose: Adjust preference density (Beta distribution) based on model capability.

Formally: alpha_t, beta_t are linear mappings of the smoothed, sigmoid-transformed Global Failure Rate.

Training Data:

DAPO-Math-17K dataset

Key Hyperparameters:

rollouts_per_group_G: 16
sigmoid_scaling_gamma: 10
total_budget_B_total: Varies (e.g., 2048, 4096, 8192)
+ 1 more
beta_sum_kappa: Constant (alpha + beta = kappa)

Compute: Allocation algorithm runs in 0.124 seconds for batch size 512/budget 8192 (vs 115s for DP baseline). Training uses standard GPUs (exact count not reported).

Comparison to Prior Work

vs. GRPO: CoBA-RL uses dynamic, non-uniform budgets vs. GRPO's fixed uniform budget.
vs. Knapsack-RL: CoBA-RL adapts the value function based on real-time global capability vs. Knapsack-RL's static value assumption.
vs. RFT (Rejection Sampling Fine-Tuning) [not cited in paper]: CoBA-RL is an RL method optimizing rollouts during training, whereas RFT selects static samples for supervised fine-tuning.
+ 1 more
vs. PPO [not cited in paper]: CoBA-RL builds on GRPO (a PPO variant) but specifically optimizes the data collection/rollout phase rather than the update rule itself.

Limitations

Relies on verifiable rewards (binary outcome), limiting applicability to open-ended generation tasks without clear pass/fail criteria.
Requires monitoring global failure rates, which might introduce slight overhead (though negligible per reported runtime).
Performance gains depend on the 'Exploit -> Explore' hypothesis holding true for the specific domain.

Reproducibility

Code: https://github.com/Within-yao/CoBA-RL

Code is publicly available at https://github.com/Within-yao/CoBA-RL. Benchmark datasets (AIME, AMC, MATH500, Olympiad) are standard. Hyperparameters like G=16 and gamma=10 are specified.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with binary verifiable rewards.

Benchmarks:

AIME24 (Math Competition)
AIME25 (Math Competition)
AMC23 (Math Competition)
MATH500 (Math Problem Solving)
OLYMPIAD Bench (Olympiad-level Math)

Metrics:

avg@16 accuracy (expected pass rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on Qwen2.5-7B-Instruct shows CoBA-RL consistently outperforming GRPO and Knapsack-RL baselines.
Average (5 benchmarks)	avg@16 accuracy	42.24	46.78	+4.54
AIME25	avg@16 accuracy	12.71	18.33	+5.62
OLYMPIAD Bench	avg@16 accuracy	41.33	43.11	+1.78
AMC23	avg@16 accuracy	Not reported in the paper	Not reported in the paper	+6.72
Computational efficiency analysis comparing the proposed Heap-Based Greedy strategy against a Dynamic Programming baseline.
Runtime Simulation	Execution Time (seconds)	115.05	0.124	-114.926

Experiment Figures

Training curves (Avg@16 accuracy) on the Olympiad benchmark for different models.

Performance comparison under varying total budget constraints (2048 to 8192).

Main Takeaways

Dynamic budget allocation significantly improves data efficiency; CoBA-RL achieves better results with half the budget compared to uniform GRPO.
The 'Exploit -> Explore' strategy (consolidating easy tasks first) empirically outperforms 'Explore -> Exploit' and static strategies.
Gains are consistent across different model sizes (1.7B to 7B) and types (Base vs. Instruct), suggesting scalability.
The heap-based allocation algorithm is computationally negligible, making it suitable for online RL training loops.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL)
Group Relative Policy Optimization (GRPO)
Beta Distribution

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs on tasks with clear success criteria (e.g., math problems) using RL.

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same prompt against their group average, eliminating the need for a value network.

Rollout Budget: The number of response trajectories generated for a specific prompt during an RL training step.

Global Failure Rate: The proportion of tasks in a batch that the model fails to solve, used as a proxy for the model's current capability.

Exploit -> Explore: A training strategy where the model first focuses on mastering tasks it is already good at (exploitation) before shifting resources to difficult, uncertain tasks (exploration).

Heap-Based Greedy Strategy: An algorithmic approach using a priority queue to iteratively assign resources to the item providing the highest immediate marginal gain.