Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub

📝 Paper Summary

Agentic software engineering AI coding agents

A large-scale empirical study of 33,000 agent-authored GitHub pull requests reveals that agents struggle most with performance and bug-fix tasks, often failing due to reviewer abandonment, duplicate submissions, and CI build failures.

Core Problem

While AI coding agents increasingly act as autonomous contributors, it is unknown why many of their pull requests (PRs) fail to be merged in real-world software projects.

Why it matters:

Agentic contributions are rapidly increasing, but blindly deploying them wastes maintainer time if PRs are consistently rejected.
Prior benchmarks evaluate agents in isolation (e.g., generating code snippets), missing the complex socio-technical factors of real workflows like CI pipelines, code reviews, and project coordination.

Concrete Example: An agent submits a PR titled 'testing DO NOT MERGE' or re-implements a feature already covered by an existing PR. The maintainer closes it with 'Superseded by PR #715' or ignores it entirely (reviewer abandonment), wasting resources.

Key Novelty

Taxonomy of Agentic PR Failures

Analyzes 33k real-world PRs from five agents (Copilot, Codex, Devin, Cursor, Claude Code) to quantify merge rates across task types.
Develops a hierarchical taxonomy of rejection reasons (Reviewer, Pull Request, Code, Agentic levels) based on qualitative analysis of 600 rejected PRs.

Evaluation Highlights

OpenAI Codex achieves the highest merge rate (82.59%), while GitHub Copilot has the lowest (43.04%) among the agents studied.
Tasks related to documentation (84% merge rate) and CI (79%) are most successful; performance (55%) and bug-fix (64%) tasks are least successful.
Reviewer abandonment is the most frequent rejection pattern (38%), followed by duplicate PRs (23%) and CI/test failures (17%).

Breakthrough Assessment

7/10

Provides the first large-scale empirical grounding for how autonomous agents perform in the wild, moving beyond synthetic benchmarks to reveal critical socio-technical failure modes.

⚙️ Technical Details

Problem Definition

Setting: Empirical analysis of agent-authored Pull Requests (PRs) in open-source GitHub repositories.

Inputs: Dataset of 33,596 PRs submitted by 5 AI agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, Claude Code).

Outputs: Quantitative characterization of merge rates and qualitative taxonomy of rejection reasons.

Pipeline Flow

Data Collection (AIDev-pop dataset)
Quantitative Analysis (Merge rates, Code changes, CI outcomes)
Qualitative Analysis (Manual labeling of rejection reasons)

System Modules

Data Collection

Filter and select 33k agent-authored PRs from GitHub projects with >100 stars

Quantitative Metrics Extraction

Compute statistics on task types, code size, and process dynamics

Qualitative Taxonomy Construction

Manually label rejection reasons for a subset of PRs

Novel Architectural Elements

Hierarchical taxonomy of agentic-PR rejection patterns distinguishing between socio-technical failures (abandonment, misalignment) and technical failures (build breaks, incorrect logic)

Comparison to Prior Work

vs. Swe-bench: Evaluates agents in the wild (live GitHub PRs) rather than a controlled sandbox, capturing social failures like reviewer abandonment and policy violations
vs. Human PR studies: Identifies unique failure modes for agents, such as 'hallucinated' dependencies or lack of response to instructions, which differ from typical human rejection reasons

Limitations

Relies on the AIDev-pop dataset's identification of agent-authored PRs, which may miss some agents or include false positives.
Qualitative analysis is limited to 600 PRs, a small fraction of the total 33k.
Does not interview maintainers directly to confirm reasons for 'reviewer abandonment'—inferred from lack of activity.

Reproducibility

The replication package is publicly available (referenced in conclusion). The study uses the existing AIDev-pop dataset. Specific scripts for metric extraction are part of the replication package.

📊 Experiments & Results

Evaluation Setup

Retrospective empirical analysis of GitHub Pull Requests.

Benchmarks:

AIDev-pop dataset subset (Real-world Pull Request submission)

Metrics:

Merge Rate (%)
Rejection Reason prevalence
Cliff's delta (effect size for code/process metrics)
Odds Ratios (logistic regression)
Statistical methodology: Cliff's delta for effect size; Logistic regression for predictive modeling; Cohen's kappa for inter-rater reliability.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Merge rates vary significantly by agent type and task category.
Merge Rate	Merge Rate (%)	43.04	82.59	+39.55
Task Type Analysis	Merge Rate (%)	55	84	+29
Quantitative differences between merged and not-merged PRs show that failures are associated with larger changes and CI breaks.
Comparison of Merged vs Not-Merged	Cliff's delta (LOC Changes)	0	0.17	0.17
Comparison of Merged vs Not-Merged	Cliff's delta (CI Failures)	0	0.24	0.24
Qualitative analysis of 600 rejected PRs reveals the primary reasons for failure.
Rejection Reasons	Prevalence (%)	0	38	38

Experiment Figures

Heatmap of merge rates across 5 agents and 11 task types.

Kernel density plots comparing Merged vs. Not-Merged PRs for #LOC, #Files, and #CI Failures.

Main Takeaways

Agents struggle with complex logic: 'Performance' and 'Bug-fix' tasks have the lowest merge rates, while rote tasks like 'Documentation' and 'CI' updates succeed most often.
Socio-technical misalignment is a major blocker: 38% of failures are due to reviewer abandonment (no engagement), suggesting agents fail to signal value or trustworthiness.
Coordination failure: 23% of rejections are duplicates, indicating agents lack awareness of existing PRs or ongoing work.
Logistic regression confirms that larger code changes and CI failures strongly predict rejection; each failed CI check reduces merge odds by ~15%.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Git/GitHub workflows (Pull Requests, Merging, CI/CD)
Familiarity with software maintenance task types (refactoring, docs, chore, etc.)

Key Terms

PR: Pull Request—a proposal to merge new code changes into a software repository

CI: Continuous Integration—automated systems that run tests and build checks whenever code is submitted

reviewer abandonment: A failure pattern where a PR receives no meaningful engagement or feedback from human maintainers before being closed

agentic workflow: A development process where AI agents autonomously plan, write, and submit code changes rather than just suggesting completions

Cliff's delta: A non-parametric effect size measure used to quantify the magnitude of difference between two distributions (e.g., merged vs. not-merged PRs)