Agentic Software Issue Resolution with Large Language Models: A Survey

📝 Paper Summary

Agentic AI Automated Software Engineering

This survey systematically reviews 126 studies on agentic software issue resolution, proposing a taxonomy across benchmarks, techniques, and empirical studies while highlighting the paradigm shift toward reinforcement learning.

Core Problem

Existing surveys fragment the field into automated program repair (APR) or code generation, failing to capture the holistic, multi-step nature of issue resolution or the recent transition from prompt engineering to RL-based training.

Why it matters:

Issue resolution encompasses diverse activities (optimization, feature addition) beyond just bug fixing, which traditional APR taxonomies overlook
Real-world resolution requires long-horizon reasoning and feedback-driven decision making, demanding agentic capabilities rather than single-step generation
A paradigm shift is occurring where researchers are moving from general-purpose LLMs to training domain-specific models via reinforcement learning, which prior surveys miss

Concrete Example: Traditional APR surveys assume the existence of triggering test cases (fault localization based on coverage). In contrast, modern agentic issue resolution often starts with only a natural language issue description, requiring the agent to autonomously locate relevant files, generate reproduction tests, and iterate on fixes without pre-existing test suites.

Key Novelty

Systematic Survey of Agentic Issue Resolution

Establishes the first comprehensive taxonomy specifically for LLM-based agentic issue resolution, covering three dimensions: benchmarks, techniques (workflow phases), and empirical studies
Identifies and formalizes the 'paradigm shift' in the field: the transition from scaffold-based prompt engineering to training-based methods leveraging reinforcement learning (RL) on LLMs
Integrates diverse software maintenance activities (bug fixing, feature addition, optimization) under a unified task definition, distinct from narrower APR or code generation scopes

Architecture

The typical framework/workflow of the automated issue resolution task as synthesized from the literature

Evaluation Highlights

systematic review of 126 recent studies filtered from an initial pool of 385 papers
analysis revealing that 62.7% of papers in this rapidly evolving field are currently preprints (arXiv) rather than peer-reviewed publications
bibliometric evidence showing AI venues (26.9%) currently outpace Software Engineering venues (10.4%) in publishing research on this specific task

Breakthrough Assessment

9/10

Essential and timely survey for a rapidly exploding field. It provides the first structured roadmap and taxonomy for agentic issue resolution, clearly distinguishing it from related fields like APR.

⚙️ Technical Details

Problem Definition

Setting: Automated Software Issue Resolution within real-world repositories

Inputs: Natural language issue description provided by users and the target code repository

Outputs: A valid code patch that resolves the issue (verified by passing tests)

Comparison to Prior Work

vs. Liu et al. (2024c): This paper focuses specifically on the *issue resolution* task (benchmarks/metrics) rather than general SE agents, and includes newer RL-based training methods
vs. Yang et al. (2025a): This paper covers non-bug-fixing issues (features, optimization) and workflow modeling specific to issue resolution (e.g., lack of trigger tests), which APR surveys exclude
vs. Tao et al. (2025a): This paper treats issue resolution as a holistic agentic task requiring planning and environment interaction, rather than just a retrieval-augmented generation problem

Limitations

The survey relies on a keyword-based search strategy which may miss papers using non-standard terminology
The field is moving extremely fast (62.7% preprints), meaning the 'state-of-the-art' analysis may become outdated quickly
The quality assessment of included papers is subjective, although based on defined criteria

Reproducibility

Code: https://github.com/ZhonghaoJiang/Awesome-Issue-Solving

The survey provides a curated list of all 126 reviewed papers and related resources at https://github.com/ZhonghaoJiang/Awesome-Issue-Solving.

📊 Experiments & Results

Evaluation Setup

Systematic Literature Review (SLR) following Kitchenham et al. guidelines

Metrics:

Number of relevant papers
Publication venue distribution
Research topic distribution (Benchmark vs. Technique vs. Empirical)
Statistical methodology: Descriptive statistics of the literature pool

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
N/A	Paper Count	385	126	-259
N/A	Preprint Percentage	37.3	62.7	+25.4
N/A	Venue Distribution (AI)	10.4	26.9	+16.5

Experiment Figures

Cumulative number of papers published over time (Oct 2023 - Oct 2025) and their venue distribution

Main Takeaways

Explosive growth in the field observed starting May 2024, highlighting the urgency of this survey
A distinct paradigm shift is underway: research is moving from 'Prompt Engineering' (using general models with scaffolds) to 'Agent Training' (using RL to specialize models for issue resolution)
Current taxonomies (APR, Code Gen) are insufficient for Issue Resolution because they ignore the complexity of environment interaction (e.g., generating reproduction tests) and diverse maintenance types

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and their application in code generation
Familiarity with software maintenance workflows (bug fixing, testing, version control)
Basic knowledge of reinforcement learning concepts (used in recent agent training)

Key Terms

Issue Resolution: The task of understanding, locating, and resolving issues (bugs, features, optimizations) in code repositories based on natural language descriptions

Agentic Systems: Autonomous, goal-driven AI architectures that interpret objectives, plan multi-step tasks, and adapt behavior based on environmental feedback (e.g., compiler errors)

Scaffolds: External structured control frameworks that orchestrate task workflows, coordinate reasoning, and invoke tools to guide the LLM

APR: Automated Program Repair—a traditional field focused on fixing bugs, typically assuming the existence of failing test cases

SWE-bench: A widely adopted benchmark for evaluating LLMs on real-world software engineering issues collected from GitHub repositories

Reinforcement Learning (RL): A training method where models learn to make sequences of decisions by receiving feedback (rewards) from the environment

Agentic Pipeline: A system decomposing issue resolution into a sequence of controllable, staged steps (less autonomous than full agents)

Reproduction Test: A test case generated to simulate the reported issue scenario, used to verify the bug and validate the fix

Snowballing: A literature search strategy involving iteratively inspecting the references (backward) and citations (forward) of collected papers