Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards

📝 Paper Summary

Software Engineering Agents Reinforcement Learning for Agents

Agent-RLVR enables effective reinforcement learning for software engineering agents by incorporating teacher guidance (hints, plans, environment feedback) during training to overcome sparse reward landscapes in complex, multi-step tasks.

Core Problem

Standard RLVR fails in agentic settings because multi-step reasoning tasks have extremely sparse rewards; the probability of an agent independently discovering a correct trajectory is too low to provide a learning signal.

Why it matters:

Frontier LLMs struggle with high failure rates in complex, multi-turn environments like software engineering
Interacting with real execution environments is computationally expensive, making inefficient exploration prohibitive
Existing RLVR success in math/coding (single-turn) does not translate to agentic workflows requiring navigation and sequential decision-making

Concrete Example: In a software engineering task requiring navigation through a large repo to fix a bug, a standard agent might fail to locate the relevant file 100% of the time, receiving zero reward and learning nothing. Agent-RLVR provides a 'hint' (guidance) pointing to the correct file, allowing the agent to proceed, succeed, and generate a positive trajectory for RL optimization.

Key Novelty

Agent Guidance-Augmented RLVR

Injects 'guidance' (plans, file pointers, error corrections) during the training phase to actively steer agents toward successful trajectories when they would otherwise fail
Uses successful guided trajectories to update the policy via Direct Preference Optimization (DPO), allowing the model to learn from successes it couldn't originally achieve on its own
Curates a dataset of 817 environments complete with problem statements and expert guidance specifically for this training loop

Architecture

The Agent-RLVR training loop illustrating the interaction between the agent, the environment, and the guidance mechanism.

Evaluation Highlights

Improves Qwen-2.5-72B-Instruct Pass@1 from 9.4% to 22.4% on SWE-Bench Verified (main result)
Guidance model improves Pass@32 from 34.2% to 38.4%, showing that guidance helps exploration
Training a reward model on the RLVR data further boosts Pass@1 to 27.8% (using Best-of-32 ranking)

Breakthrough Assessment

8/10

Significantly adapts RLVR (usually for math) to complex agentic tasks. The jump from 9.4% to 22.4% with a relatively small dataset (817 tasks) is substantial and demonstrates high data efficiency.

⚙️ Technical Details

Problem Definition

Setting: Software Engineering (SWE) tasks where an agent must fix a GitHub issue in a real repository

Inputs: Natural language issue description and a snapshot of a codebase

Outputs: A code patch (diff) that passes unit tests

Pipeline Flow

Trajectory Generation (Agent attempts task)
Validation (Run unit tests in Docker)
Guidance Injection (If fail, re-attempt with hints)
Policy Update (DPO training on success/fail pairs)

System Modules

Agent Scaffold

Manages agent loop: localization (finding files) and repair (writing patches)

Model or implementation: Qwen-2.5-72B-Instruct

Environment/Validator

Executes code and runs unit tests to provide verifiable reward

Model or implementation: Docker container with specific repo environment

Guidance Generator

Generates hints (plans, location pointers, error feedback) for failed trajectories

Model or implementation: claude-3-7-sonnet-20250219 (External LLM)

Novel Architectural Elements

Guidance-augmented training loop: Integrating an external teacher (guidance) to actively repair failed trajectories during data collection to populate the RL buffer

Modeling

Base Model: Qwen-2.5-72B-Instruct

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to prefer winning trajectories (passing tests) over losing ones.

Formally: L_DPO(π_θ; π_ref) = -E_{(x, y_w, y_l) ~ D} [log σ(β * log(π_θ(y_w|x)/π_ref(y_w|x)) - β * log(π_θ(y_l|x)/π_ref(y_l|x)))]

Training Data:

817 training environments (593 from SWE-Gym, 219 self-collected)
8186 total trajectories generated via sampling
Positive trajectories sourced from successful unguided runs AND successful guided runs
Pairs constructed by matching correct trajectory with incorrect (non-guided) trajectory

Key Hyperparameters:

learning_rate: 1e-6 (DPO), 1e-5 (SFT)
epochs: 1 (DPO), 5 (SFT)
sequence_length: 8k

Compute: 4 nodes with 8 H100 GPUs each (32 H100s total) for 10 hours (72B model)

Comparison to Prior Work

vs. Agentless: Agent-RLVR adds a learning loop (RLVR) to update the model weights, whereas Agentless is a fixed pipeline.
vs. Standard RLVR/STaR: Agent-RLVR introduces 'Guidance' (teacher forcing via hints) to overcome the sparsity of rewards in complex agent settings, whereas STaR relies on the model solving problems independently.
vs. Llama-3-70B-Instruct (SWE-Bench Verified baseline): Agent-RLVR significantly outperforms open weights baselines on the verified subset [not cited in paper explicitly as baseline, but implied by leaderboard context].

Limitations

High computational cost for training environments (requires executing Docker containers for every reward signal)
Relies on a powerful external model (Claude 3.7) for generating high-quality guidance
Scaffold simplifications (removed patch selection) might limit ceiling compared to full Agentless pipeline
No statistical significance tests reported for the improvements

Reproducibility

Not provided: Code repository URL is not present. Dataset is described but not linked. Prompt templates for guidance are described conceptually but not provided verbatim. Uses closed-source model (Claude 3.7 Sonnet) for generating guidance data.

📊 Experiments & Results

Evaluation Setup

Software Engineering agents fixing bugs in Python repositories

Benchmarks:

SWE-Bench Verified (Real-world Software Engineering (Bug Fixing))

Metrics:

Pass@1
Pass@32
Best@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results demonstrating the effectiveness of Agent-RLVR compared to the base model and a strong SFT baseline on SWE-Bench Verified.
SWE-Bench Verified	Pass@1	9.4	22.4	+13.0
SWE-Bench Verified	Pass@1	17.2	22.4	+5.2
Results showing the added value of using a Reward Model trained on the RLVR data.
SWE-Bench Verified	Pass@1	22.4	27.8	+5.4
Impact of guidance on generation diversity and success rates during data collection.
SWE-Bench Verified	Pass@32	34.2	38.4	+4.2

Experiment Figures

Pass rates (Pass@k) for base model vs. guidance-augmented model, showing guidance helps discover more solutions.

Main Takeaways

Agent-RLVR significantly improves performance on agentic tasks where rewards are sparse, more than doubling the base model's Pass@1.
Guidance is critical: it serves as a pedagogical tool that helps the agent traverse difficult solution spaces it couldn't solve alone, effectively 'densifying' the reward landscape.
The data generated via Agent-RLVR is dual-purpose: effective for policy training (DPO) and highly effective for training reward models (RM) for test-time reranking.
Improvements are achieved with high data efficiency (only 817 training environments used).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) concepts (policy, reward, trajectories)
Direct Preference Optimization (DPO)
Software Engineering workflows (unit tests, repositories, patches)
LLM Agent architectures (scaffolding, tool use)

Key Terms

RLVR: Reinforcement Learning from Verifiable Rewards—training models using objective success signals (like passing tests) rather than human preference labels

Pass@k: The probability that at least one correct solution is generated when k samples are produced

SWE-Bench Verified: A benchmark for evaluating software engineering agents, consisting of real-world GitHub issues and verified unit tests

DPO: Direct Preference Optimization—an algorithm that optimizes a language model to prefer winning responses over losing ones without explicitly training a separate reward model

scaffold: The fixed code structure or logic flow that manages the agent's interaction with the environment (e.g., parsing outputs, executing tools)

trajectories: The sequence of actions and observations generated by an agent while attempting to solve a task

instruct-tuning: Supervised fine-tuning (SFT) of a model on instruction-response pairs to teach it to follow directions

Best-of-N: An inference strategy where N solutions are generated and the best one is selected (often by a reward model)

unit tests: Automated code tests that verify if a specific part of the software works as expected; used here as the ground-truth reward signal

SFT: Supervised Fine-Tuning—training a model on labeled examples

KL divergence: A statistical measure of how one probability distribution differs from a second, reference probability distribution; used as a penalty to prevent the model from deviating too much from its initial state