SWE-Fuse: Empowering Software Agents via Issue-free Trajectory Learning and Entropy-aware RLVR Training

📝 Paper Summary

Software Engineering Agents Automated Bug Repair Reinforcement Learning for Reasoning

SWE-Fuse improves automated bug fixing by training agents on 'issue-free' data to force debugging via test cases and applying entropy-adaptive reinforcement learning to balance exploration and stability.

Core Problem

Real-world software issue descriptions are often noisy, misleading, or misaligned with the actual fix, causing agents to hallucinate solutions instead of debugging the code.

Why it matters:

Misaligned descriptions in datasets like SWE-bench mislead agents (e.g., description complains about warnings, but the fix involves image encoding logic)
High-quality Issue-PR pairs are scarce (e.g., 30% of SWE-smith samples have empty problem statements)
Standard training often allows models to overfit to text descriptions rather than learning robust step-by-step debugging reasoning

Concrete Example: An issue description reports a 'TypeError' in `warnings_handler`, but the actual ground-truth patch fixes TIFF image saving logic in `TiffImagePlugin.py`. An agent relying on the text would waste time investigating the warnings module, whereas an agent trained to debug test failures would locate the actual image encoding error.

Key Novelty

Issue-Free Trajectory Learning & Entropy-Aware RLVR

Removes issue descriptions from a subset of training data, forcing the model to solve problems solely by running tests and analyzing execution feedback (debugging) rather than reading the prompt
Uses entropy-aware RLVR (Reinforcement Learning with Verifiable Reward) that dynamically adjusts the PPO clipping range: looser clipping when model entropy (uncertainty) is high to encourage exploration, and tighter clipping when entropy is low to ensure stability

Architecture

Overview of the SWE-Fuse framework, illustrating the data construction, issue-free SFT, and entropy-aware RLVR phases.

Evaluation Highlights

Achieves 60.2% solve rate on SWE-bench Verified with SWE-Fuse-Qwen3-32B, setting a new state-of-the-art for open-source 32B models
Achieves 43.0% solve rate with the smaller SWE-Fuse-Qwen3-8B model
Test-time scaling (TTS@8) further boosts performance to 65.2% (32B model) and 49.8% (8B model)

Breakthrough Assessment

8/10

Significant performance on SWE-bench Verified (60.2% for 32B) suggests the 'issue-free' training strategy effectively addresses the data quality bottleneck in software engineering agents.

⚙️ Technical Details

Problem Definition

Setting: Sequential decision-making process where an agent interacts with a software repository to resolve GitHub issues

Inputs: Initial state containing issue description (optional) and repository sandbox

Outputs: A sequence of reasoning steps and bash commands, culminating in a `submit` action with a code patch

Pipeline Flow

Reasoning Generation (Trace)
Action Selection (Tool Call)
Environment Execution (Sandbox)
Observation Feedback

System Modules

Agent Policy

Generate reasoning thoughts and bash commands based on history

Model or implementation: Qwen3-8B or Qwen3-32B

Sandbox Environment

Execute bash commands and return results

Model or implementation: Docker-based environment (Mini-SWE-Agent-Plus scaffold)

Novel Architectural Elements

Entropy-driven clipping mechanism in the PPO loss function (modifying the optimization objective rather than the inference architecture)

Modeling

Base Model: Qwen3-8B and Qwen3-32B

Training Method: Issue-free-driven SFT followed by Entropy-aware RLVR

Objective Functions:

Purpose: Maximize likelihood of expert trajectories (SFT).

Formally: Standard autoregressive log-likelihood loss.
Purpose: Optimize policy via RL while managing stability based on uncertainty.

Formally: PPO clipped objective where clipping radius epsilon is a function of normalized entropy H_norm.

Adaptation: Full fine-tuning (implied, not explicitly restricted)

Training Data:

Repo Collection: 33,274 issues from SWE-smith
Trajectory Generation: Gemini 3 used as teacher to generate 14k correct trajectories
Filtering: Removed 'git log'/'git show' commands to prevent metadata exploitation
Issue-free split: Subset of data has issue descriptions removed (i = empty)

Key Hyperparameters:

interaction_turns_max: 100
rl_baseline: RLOO (Reward Leave-One-Out)
clipping_radius: Adaptive (epsilon_min to epsilon_max based on entropy)

Comparison to Prior Work

vs. SWE-agent: SWE-Fuse uses 'issue-free' training to reduce reliance on descriptions, whereas SWE-agent relies on standard issue-context
vs. Standard RLVR/PPO: SWE-Fuse uses entropy-aware dynamic clipping instead of fixed clipping radius to handle exploration-exploitation balance

Limitations

Requires executable environments (Docker) for all training samples, which is computationally heavy
Effectiveness depends on the quality of test cases available in the repository (if tests are poor, issue-free learning fails)
High-quality issue-PR pairs are difficult to acquire at scale

Reproducibility

Code repository link in paper appears to be a placeholder ('https://github.com/codefuse-ai/xxx'). Base docker images from SWE-smith are used. Trajectory dataset of 14k samples is introduced but availability is tied to the repo.

📊 Experiments & Results

Evaluation Setup

End-to-end software issue resolution in verified GitHub repositories

Benchmarks:

SWE-bench Verified (Automated Bug Repair)

Metrics:

Solve Rate (% of issues where generated patch passes tests)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Internal scaling experiments demonstrate the benefit of Test-Time Scaling (TTS) on top of the SWE-Fuse trained models.
SWE-bench Verified	Solve Rate	43.0	49.8	+6.8
SWE-bench Verified	Solve Rate	60.2	65.2	+5.0

Experiment Figures

A case study illustrating the 'Misalignment' problem in SWE-bench.

Main Takeaways

SWE-Fuse establishes a new state-of-the-art for open-source 32B models with a 60.2% solve rate on SWE-bench Verified.
The issue-free training strategy effectively mitigates the impact of noisy or misleading issue descriptions, allowing the model to learn robust debugging.
Entropy-aware RLVR stabilizes training by adjusting the trust region based on model uncertainty, preventing over-penalization of exploration.
Test-time scaling provides consistent gains (+5-6.8%) across model sizes, indicating the model produces diverse correct solutions.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
LLM Agent Frameworks (ReAct)
Software Engineering workflows (Git, CI/CD)

Key Terms

RLVR: Reinforcement Learning with Verifiable Reward—using objective outcomes (like passing unit tests) as the reward signal for RL training

Issue-free: A training setup where the natural language problem description is removed, forcing the agent to identify the bug using only the provided codebase and failing test cases

Entropy-aware clipping: Modifying the PPO trust region size based on the model's predictive uncertainty (entropy); high entropy allows larger updates (exploration), low entropy enforces smaller updates (stability)

ReAct: Reason+Act—a paradigm where the model alternates between generating a thought (reasoning trace) and executing an action (tool call)

SWE-bench Verified: A subset of the SWE-bench dataset containing real-world GitHub issues and pull requests, filtered for quality and reproducibility

RLOO: Reward Leave-One-Out—a baseline variance reduction technique for policy gradient methods where the baseline for a sample is the mean reward of other samples in the batch

TTS: Test-Time Scaling—generating multiple candidate solutions at inference time and selecting the best one (often via voting or test execution)

PPO: Proximal Policy Optimization—an RL algorithm that constrains policy updates to a 'trust region' to prevent catastrophic forgetting or instability

SFT: Supervised Fine-Tuning—training the model on expert demonstration trajectories before applying RL