SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

📝 Paper Summary

Software Engineering (SE) for LLMs Reasoning via Reinforcement Learning (RL)

SWE-RL improves LLM reasoning by applying reinforcement learning on open-source software evolution data (pull requests) with rule-based rewards, achieving state-of-the-art results on SWE-bench without proprietary distillation.

Core Problem

Current LLM approaches for software engineering rely heavily on proprietary models (GPT-4) or SFT, lacking the ability to self-improve reasoning through RL due to the high cost of execution-based rewards.

Why it matters:

Proprietary models limit accessibility and transparency in software engineering research
Existing open models struggle with complex real-world issue solving compared to closed models
Standard RL approaches (like in math/comp-coding) rely on execution feedback which is costly or unavailable for partial repository contexts

Concrete Example: When given a GitHub issue, a standard Llama-3 model often fails to locate the bug or generates incorrect patch formats. In contrast, SWE-RL, trained on PR histories, learns to 'think' longer about fault locations and produces valid patches by maximizing similarity to ground-truth developer edits.

Key Novelty

SWE-RL (Reinforcement Learning on Software Evolution)

Train an LLM using RL (GRPO) directly on historical Pull Request (PR) data, treating the developer's final patch as the ground truth oracle
Use a lightweight rule-based reward (sequence similarity between predicted and oracle patch) instead of expensive execution feedback or binary correctness
Include full file contents in the prompt to implicitly force the model to perform fault localization and reasoning before editing

Architecture

Overview of the SWE-RL framework: from PR data collection to RL training loop and final inference.

Evaluation Highlights

Achieves 41.0% pass@1 on SWE-bench Verified, the best performance among open models <100B parameters and comparable to GPT-4o
Demonstrates generalization: +6.3% on HumanEval+ and +4.2% on MBPP+ compared to the base model, despite training only on issue solving
Surpasses supervised fine-tuning (SFT) baselines on both in-domain (SWE-bench) and out-of-domain tasks (MATH, CRUXEval)

Breakthrough Assessment

9/10

First successful application of RL solely on software evolution data to achieve SOTA reasoning. Demonstrates that RL on SE data generalizes to math/reasoning, a major finding parallel to DeepSeek-R1 but using open data.

⚙️ Technical Details

Problem Definition

Setting: Given an issue description and code context (file contents), generate a patch that resolves the issue

Inputs: Issue description I, Code Context C (full content of relevant files)

Outputs: Patch P (unified diff format)

Pipeline Flow

Localization (Map issue to files)
Repair (Generate patch via LLM)

System Modules

Agentless Mini (Localization)

Identify relevant files to be edited based on the issue description

Model or implementation: Rules/Heuristics (Simplified from Agentless)

Repair Policy

Reason about the bug and generate a unified diff patch

Model or implementation: Llama3-SWE-RL-70B

Modeling

Base Model: Llama-3.3-70B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward while staying close to reference policy.

Formally: GRPO objective with KL divergence penalty

Training Data:

273k high-quality PR seeds extracted from GitHub
Filtered for issues describing bugs and changes involving programming files
Includes issue description, code context, and oracle patch

Key Hyperparameters:

context_window: 16k
global_batch_size: 512
learning_rate: Not explicitly reported in the paper (implies standard config)
+ 2 more
steps: 1600
group_size: 16

Compute: 512 NVIDIA H100 GPUs for approx. 32 hours

Comparison to Prior Work

vs. DeepSeek-R1: Uses non-execution rule-based rewards (patch similarity) instead of execution feedback
vs. Agentless: Adds RL training to the underlying model specifically for repair, simplifying the pipeline
vs. SWE-Gym: Does not require execution environment during training; trains on static PR data
+ 1 more
vs. OpenDevin [not cited in paper]: Focuses on enhancing the model's intrinsic reasoning via RL rather than building complex agentic loops

Limitations

Reward function (string similarity) may not capture semantic equivalence of code
Localization in Agentless Mini is simplified and might miss context compared to full multi-step retrieval
High computational cost for training (512 H100s)
Pipeline-based approach lacks interactive feedback loop during inference

Reproducibility

Code: https://github.com/facebookresearch/SWE-RL

Code and models are publicly available. Dataset curation details provided in Appendix. Uses standard open models (Llama-3) and libraries (difflib).

📊 Experiments & Results

Evaluation Setup

Solve real-world GitHub issues given issue descriptions and codebase

Benchmarks:

SWE-bench Verified (Real-world Software Issue Resolving)
HumanEval+ (Function-level Code Generation)
MBPP+ (Function-level Code Generation)
CRUXEval (Code Reasoning)
MATH (Mathematical Reasoning)

Metrics:

Pass@1 (Solve Rate)
Format Accuracy

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on real-world software engineering benchmark (SWE-bench Verified).
SWE-bench Verified	Pass@1	24.6	41.0	+16.4
SWE-bench Verified	Pass@1	42.0	41.0	-1.0
SWE-bench Verified	Pass@1	35.2	41.0	+5.8
Generalization to out-of-domain reasoning tasks (Coding & Math).
HumanEval+	Pass@1	73.2	79.5	+6.3
MATH	Accuracy	63.2	66.4	+3.2
CRUXEval	Accuracy	83.5	86.5	+3.0

Experiment Figures

Example of an 'Aha Moment' where the model self-corrects during reasoning.

Comparison of Continuous (similarity) vs Discrete (exact match) rewards during training.

Main Takeaways

RL on software evolution data induces generalized reasoning capabilities, improving performance on math and general coding tasks.
Continuous rewards (similarity score) are more effective than discrete rewards (exact match) for training on diverse real-world patches.
Scaling inference compute (more samples/tests) continues to improve performance, with significant gains up to 160 samples.
The model exhibits 'aha moments' (self-reflection and long-horizon reasoning) during training, similar to DeepSeek-R1, but using only static code data.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Software Engineering concepts (PRs, patches, issues)
Large Language Model (LLM) fine-tuning

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to stabilize training

Oracle Patch: The actual code changes merged by developers in a Pull Request, used as the ground truth for training

SWE-bench: A benchmark for evaluating LLMs on real-world GitHub issues

Pass@1: The percentage of problems where the model's top-ranked solution passes all test cases

SFT: Supervised Fine-Tuning—training a model on labeled examples using standard likelihood maximization

SequenceMatcher: A Python difflib function that calculates a similarity score between two sequences (here, code strings) in the range [0, 1]

Aha moment: A point during training where the model exhibits emergent reasoning behaviors (like self-reflection) not explicitly programmed