โ† Back to Paper List

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yan He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean M. Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, Brad Kenstler
Scale AI
arXiv.org (2025)
Benchmark Agent

๐Ÿ“ Paper Summary

Software Engineering Agents Code Generation Benchmarks Agentic Evaluation
SWE-Bench Pro is a contamination-resistant benchmark of 1,865 complex, human-verified software engineering tasks sourced from copyleft and private commercial repositories, revealing that current agents solve fewer than 45% of problems.
Core Problem
Existing coding benchmarks like SWE-Bench suffer from data contamination (public repos are in training data) and lack industrial complexity, often featuring trivial one-line fixes that do not reflect enterprise engineering challenges.
Why it matters:
  • Contamination allows models to memorize solutions rather than generalize, inflating performance metrics
  • Current benchmarks like SWE-Bench Verified contain many trivial problems (161/500 require 1-2 lines) that fail to test long-horizon reasoning needed for real work
  • Enterprise software engineering requires multi-file edits and handling ambiguity, which current academic benchmarks fail to simulate adequately
Concrete Example: In SWE-Bench Verified, a task might only require a single-line change. In contrast, SWE-Bench Pro tasks average 107.4 lines of code changes across 4.1 files, often requiring the agent to navigate complex B2B logic or UI state management that simple retrieval cannot solve.
Key Novelty
Contamination-Resistant, Enterprise-Grade Benchmark Construction
  • Constructs a dataset using only strong copyleft (GPL) repositories and private commercial codebases purchased from startups to prevent training data leakage
  • Implements a human-in-the-loop augmentation process where experts rewrite issue descriptions, add requirements, and verify unit tests to ensure resolvability without ambiguity
  • Focuses exclusively on long-horizon tasks requiring substantial edits (average 100+ lines), rejecting trivial fixes to stress-test agent planning capabilities
Evaluation Highlights
  • State-of-the-art coding models achieve less than 45% Pass@1 on SWE-Bench Pro, indicating a significant capability gap for enterprise tasks
  • The benchmark includes 1,865 total problems, with a 'Commercial' subset of 276 problems from private startup repositories to strictly test generalization
  • Reference solutions involve substantial complexity, averaging 107.4 lines of code changes across 4.1 files per task
Breakthrough Assessment
9/10
Addresses the critical 'contamination' crisis in coding benchmarks by using private/GPL data and significantly raises the difficulty ceiling to match real industrial work. Likely to become the new standard for serious agent evaluation.
×