← Back to Paper List

Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub

Ramtin Ehsani, Sakshi Pathak, Shriya Rawal, Abdullah Al Mujahid, Mia Mohammad Imran, Preetha Chatterjee
Drexel University, Missouri University of Science and Technology
arXiv (2026)
Agent Benchmark

📝 Paper Summary

Agentic software engineering AI coding agents
A large-scale empirical study of 33,000 agent-authored GitHub pull requests reveals that agents struggle most with performance and bug-fix tasks, often failing due to reviewer abandonment, duplicate submissions, and CI build failures.
Core Problem
While AI coding agents increasingly act as autonomous contributors, it is unknown why many of their pull requests (PRs) fail to be merged in real-world software projects.
Why it matters:
  • Agentic contributions are rapidly increasing, but blindly deploying them wastes maintainer time if PRs are consistently rejected.
  • Prior benchmarks evaluate agents in isolation (e.g., generating code snippets), missing the complex socio-technical factors of real workflows like CI pipelines, code reviews, and project coordination.
Concrete Example: An agent submits a PR titled 'testing DO NOT MERGE' or re-implements a feature already covered by an existing PR. The maintainer closes it with 'Superseded by PR #715' or ignores it entirely (reviewer abandonment), wasting resources.
Key Novelty
Taxonomy of Agentic PR Failures
  • Analyzes 33k real-world PRs from five agents (Copilot, Codex, Devin, Cursor, Claude Code) to quantify merge rates across task types.
  • Develops a hierarchical taxonomy of rejection reasons (Reviewer, Pull Request, Code, Agentic levels) based on qualitative analysis of 600 rejected PRs.
Evaluation Highlights
  • OpenAI Codex achieves the highest merge rate (82.59%), while GitHub Copilot has the lowest (43.04%) among the agents studied.
  • Tasks related to documentation (84% merge rate) and CI (79%) are most successful; performance (55%) and bug-fix (64%) tasks are least successful.
  • Reviewer abandonment is the most frequent rejection pattern (38%), followed by duplicate PRs (23%) and CI/test failures (17%).
Breakthrough Assessment
7/10
Provides the first large-scale empirical grounding for how autonomous agents perform in the wild, moving beyond synthetic benchmarks to reveal critical socio-technical failure modes.
×