SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

📝 Paper Summary

Software Engineering Agents Code Generation Benchmarks Agentic Evaluation

SWE-Bench Pro is a contamination-resistant benchmark of 1,865 complex, human-verified software engineering tasks sourced from copyleft and private commercial repositories, revealing that current agents solve fewer than 45% of problems.

Core Problem

Existing coding benchmarks like SWE-Bench suffer from data contamination (public repos are in training data) and lack industrial complexity, often featuring trivial one-line fixes that do not reflect enterprise engineering challenges.

Why it matters:

Contamination allows models to memorize solutions rather than generalize, inflating performance metrics
Current benchmarks like SWE-Bench Verified contain many trivial problems (161/500 require 1-2 lines) that fail to test long-horizon reasoning needed for real work
Enterprise software engineering requires multi-file edits and handling ambiguity, which current academic benchmarks fail to simulate adequately

Concrete Example: In SWE-Bench Verified, a task might only require a single-line change. In contrast, SWE-Bench Pro tasks average 107.4 lines of code changes across 4.1 files, often requiring the agent to navigate complex B2B logic or UI state management that simple retrieval cannot solve.

Key Novelty

Contamination-Resistant, Enterprise-Grade Benchmark Construction

Constructs a dataset using only strong copyleft (GPL) repositories and private commercial codebases purchased from startups to prevent training data leakage
Implements a human-in-the-loop augmentation process where experts rewrite issue descriptions, add requirements, and verify unit tests to ensure resolvability without ambiguity
Focuses exclusively on long-horizon tasks requiring substantial edits (average 100+ lines), rejecting trivial fixes to stress-test agent planning capabilities

Evaluation Highlights

State-of-the-art coding models achieve less than 45% Pass@1 on SWE-Bench Pro, indicating a significant capability gap for enterprise tasks
The benchmark includes 1,865 total problems, with a 'Commercial' subset of 276 problems from private startup repositories to strictly test generalization
Reference solutions involve substantial complexity, averaging 107.4 lines of code changes across 4.1 files per task

Breakthrough Assessment

9/10

Addresses the critical 'contamination' crisis in coding benchmarks by using private/GPL data and significantly raises the difficulty ceiling to match real industrial work. Likely to become the new standard for serious agent evaluation.

⚙️ Technical Details

Problem Definition

Setting: Repository-level issue resolution: Given a codebase and a natural language task description (issue + requirements), generate a patch that passes a hidden test suite.

Inputs: Entire codebase (files), Task Description (Problem Statement, Requirements, Interface definitions)

Outputs: A unified diff (patch) file applying changes to the codebase

Pipeline Flow

Task Input (Issue + Requirements)
Agent Scaffold (SWE-Agent)
Environment Execution (Docker)
Evaluation (Test Suite)

System Modules

Task Description Provider

Provides human-augmented context including problem statement, explicit requirements, and interface expectations

Model or implementation: Human-curated

Agent Scaffold

Interacts with the codebase to generate a patch

Model or implementation: Various (e.g., GPT-4o, Claude 3.5 Sonnet) via SWE-Agent scaffold

Evaluation Harness

Runs the generated patch against the test suite in a containerized environment

Model or implementation: Docker-based execution environment

Novel Architectural Elements

Data Curation Architecture: A three-tier dataset structure (Public/Copyleft, Held-Out/Copyleft, Commercial/Private) specifically designed to defeat contamination
Human-in-the-loop Augmentation Pipeline: Systematic injection of 'Requirements' and 'Interface' specifications into task descriptions to decouple 'ambiguity resolution' from 'coding ability'

Modeling

Base Model: Various models evaluated (e.g., GPT-4o, Claude 3.5 Sonnet)

Training Method: Evaluation only (Paper focuses on benchmark creation)

Training Data:

1,865 total problems
Public Set: 731 instances (GPL repos)
Commercial Set: 276 instances (Private startup repos)
Held-Out Set: 858 instances (GPL repos, kept private)

Key Hyperparameters:

max_turns: 50
scaffold: SWE-Agent
temperature: Not reported in the paper

Compute: Models hosted on single node with 8 H100 Nvidia GPUs (for open weights)

Comparison to Prior Work

vs. SWE-bench: SWE-Bench Pro uses GPL/private repos to avoid contamination and ensures tasks are long-horizon (100+ lines vs often trivial edits)
vs. SWE-bench Verified: SWE-Bench Pro tasks are significantly harder (multi-file, complex logic) compared to the many 1-2 line fixes in Verified
vs. LiveBench [not cited in paper]: Similar goal of avoiding contamination, but SWE-Bench Pro focuses specifically on repo-level engineering rather than general Q&A/puzzles

Limitations

Evaluation costs are high due to the long-horizon nature of tasks (requiring many turns)
Strict requirement for human verification limits the scale of the dataset compared to purely scraped benchmarks
Focus on 'resolvability' by adding explicit requirements might reduce the realism of handling vague stakeholder requests
Results on Commercial and Held-Out sets are reported but cannot be independently verified by the community

Reproducibility

Code: https://github.com/swe-bench/swe-bench

Public set (731 problems) is released on HuggingFace. Docker environments for reproduction are provided. Commercial and Held-Out sets are not released to maintain contamination resistance. Evaluation uses the standard SWE-Agent scaffold.

📊 Experiments & Results

Evaluation Setup

Agent-based software issue resolution using the SWE-Agent scaffold

Benchmarks:

SWE-Bench Pro (Repository-level issue resolution) [New]

Metrics:

Pass@1 (Percentage of issues resolved)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation of leading models on the new benchmark shows that even the best models struggle to reach 50% success rates, confirming the task difficulty.
SWE-Bench Pro	Pass@1	Not reported in the paper	45.0	Not applicable

Main Takeaways

Current state-of-the-art agents achieve <45% Pass@1, confirming SWE-Bench Pro is significantly harder than previous iterations
Agentless scaffold performs poorly on SWE-Bench Pro due to the multi-file nature of edits, necessitating full agentic loops like SWE-Agent
The benchmark effectively mitigates contamination through the use of private and strong-copyleft repositories

📚 Prerequisite Knowledge

Prerequisites

Understanding of Git workflows (commits, diffs, PRs)
Familiarity with Unit Testing (fail2pass vs pass2pass)
Knowledge of LLM agent scaffolds (like SWE-Agent)

Key Terms

fail2pass: Tests that fail before the fix is applied and must pass after the fix (verifying the bug is resolved)

pass2pass: Tests that pass before and after the fix (regression tests ensuring no existing functionality is broken)

copyleft: A licensing scheme (e.g., GPL) requiring derivative works to be open source; used here to identify repos less likely to be in proprietary commercial training sets

GPL: General Public License—a strong copyleft license used to filter repositories for the public set

scaffold: The software framework wrapping an LLM that allows it to interact with tools, file systems, and environments (e.g., SWE-Agent)

Pass@1: The percentage of problems where the model's single generated solution successfully passes all tests

held-out set: A portion of the dataset kept private and never released to prevent future models from training on it

agent trajectory: The sequence of actions (commands, reads, edits) an agent takes while attempting to solve a task