← Back to Paper List

MASAI: Modular Architecture for Software-engineering AI Agents

Daman Arora, Atharv Sonwane, Nalin Wadhwa, Abhav Mehrotra, Saiteja Utpala, Ramakrishna Bairi, Aditya Kanade, Nagarajan Natarajan

Microsoft Research India

arXiv.org (2024)

Agent Reasoning Benchmark

📝 Paper Summary

Software Engineering Agents Multi-Agent Systems

MASAI decomposes complex software engineering tasks into sub-problems handled by specialized sub-agents (reproducers, localizers, fixers) that employ different strategies to achieve state-of-the-art bug resolution.

Core Problem

Solving repository-scale software issues requires diverse skills (testing, localization, coding) that overwhelm single-agent architectures trying to maintain a single context window.

Why it matters:

Single strategies (like ReAct) struggle to generalize across the distinct phases of bug fixing (e.g., searching vs. patching).
Long reasoning trajectories in single agents inflate costs and fill context windows with irrelevant information, degrading performance.
Existing methods often fail to rigorously verify fixes because they lack dedicated sub-processes for reproducing issues via test cases.

Concrete Example: When fixing a bug in a large repo like Django, a single agent might get lost navigating hundreds of files. It might find the bug but fail to write a reproduction test, leading to a 'fix' that introduces new errors or doesn't actually solve the reported issue.

Key Novelty

Modular Strategy-Specific Sub-Agents

Instantiates distinct sub-agents for specific phases (Test Generation, Reproduction, Localization, Fixing, Ranking), each with optimized strategies (ReAct vs. CoT).
Uses a 'lazy representation' for code retrieval to keep context concise, returning only signatures for files/classes and full bodies only for functions.
Decouples the 'Fixer' (which generates multiple patches) from the 'Ranker' (which validates them against a generated reproduction test).

Architecture

Architecture Figure Figure 2

The 5-stage pipeline of MASAI illustrating the flow of information between sub-agents.

Evaluation Highlights

Achieves 28.33% resolution rate on SWE-bench Lite, the highest among reported methods at the time of publication.
Outperforms SWE-agent (18.00%) and AutoCodeRover (22.67%) on the same benchmark.
Demonstrates high cost-efficiency with an average per-issue cost of $1.96 USD.

Breakthrough Assessment

8/10

Significant improvement over SOTA on a very difficult benchmark (SWE-bench Lite). The modular design provides a clear blueprint for future engineering agents, though it relies heavily on the strength of the underlying model (GPT-4o).

⚙️ Technical Details

Problem Definition

Setting: Repository-level automated software issue resolution

Inputs: Issue description and a code repository

Outputs: A patch file that resolves the issue (verified by passing held-out tests)

Pipeline Flow

Group 1: Test Template Generator -> Issue Reproducer
Group 2: Edit Localizer
Group 3: Fixer -> Ranker

System Modules

Test Template Generator (Testing)

Analyzes repo testing setup to create a blank template test and run command

Model or implementation: GPT-4o

Issue Reproducer (Testing)

Writes a specific test case that fails due to the bug (reproduction)

Model or implementation: GPT-4o

Edit Localizer

Navigates repo to find files/classes/functions needing edits

Model or implementation: GPT-4o

Fixer (Fixing)

Generates multiple candidate patches for identified locations

Model or implementation: GPT-4o

Ranker (Fixing)

Selects the best patch based on the reproduction test results

Model or implementation: GPT-4o

Novel Architectural Elements

Strict separation of 'Localization' (ReAct) and 'Fixing' (CoT) into distinct agents
Dedicated 'Test Template Generator' agent to solve the specific sub-problem of understanding repository-specific test harnesses
Use of a 'Ranker' agent to validate multiple candidate patches against a generated reproduction test rather than relying on a single attempt

Modeling

Base Model: GPT-4o

Compute: Average cost per issue is 1.96 USD. Total experiment cost estimated << 10k USD.

Comparison to Prior Work

vs. SWE-agent: MASAI uses modular sub-agents with differing strategies (CoT vs ReAct) rather than a single monolithic loop.
vs. AutoCodeRover: MASAI includes explicit 'Test Template Generator' and 'Issue Reproducer' agents to verify fixes via test execution, whereas ACR focuses primarily on search and retrieval tools.
vs. CodeR: MASAI achieves higher resolution (28.33% vs 16.33%) using a simpler composition flow without needing explicit inter-agent conversation protocols.

Limitations

Dependency on the quality of the 'Issue Reproducer'; if no reproduction test is created, the Ranker must rely solely on the issue description.
High reliance on the specific capabilities of GPT-4o; performance with weaker models is not explored.
The 'lazy representation' strategy might miss context if variable/function names are not semantically descriptive enough for the agent to request full bodies.

Reproducibility

Code: https://github.com/microsoft/MASAI

Code is publicly available at https://github.com/microsoft/MASAI. Uses GPT-4o. Relies on tree-sitter==0.21.1 for parsing.

📊 Experiments & Results

Evaluation Setup

Agents attempt to resolve GitHub issues in a sandboxed environment.

Benchmarks:

SWE-bench Lite (Automated Software Engineering / Bug Fixing)

Metrics:

Resolution rate (% of issues passing held-out tests)
Localization rate (% of issues where patch covers ground-truth files)
Application rate (% of patches that apply without syntax errors)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SWE-bench Lite	Resolution rate	22.67	28.33	+5.66
SWE-bench Lite	Resolution rate	18.00	28.33	+10.33
SWE-bench Lite	Resolution rate	12.67	28.33	+15.66

Experiment Figures

Bar chart comparing Resolution Rate of MASAI against SOTA methods on SWE-bench Lite.

Main Takeaways

Modular architectures outperform monolithic ReAct loops (like SWE-agent) by allowing specialized strategies for different phases of bug fixing.
The generated reproduction test is crucial; it allows the Ranker to filter out incorrect patches, significantly boosting the resolution rate.
Using 'lazy representations' (signatures only) for file reading helps manage context window limits without sacrificing necessary information for localization.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Agentic patterns (ReAct, CoT)
Familiarity with Software Engineering workflows (Test-Driven Development)
Knowledge of RAG and retrieval concepts

Key Terms

ReAct: Reasoning + Acting—a strategy where LLMs alternate between reasoning about a problem and executing actions (like running code) to solve it

CoT: Chain of Thought—prompting the LLM to generate intermediate reasoning steps before producing a final answer

SWE-bench Lite: A dataset of 300 real-world GitHub issues and pull requests from Python repositories used to benchmark software engineering agents

Lazy Representation: A retrieval strategy that initially returns only high-level signatures (class/function names) to save context, providing full code bodies only upon specific request

Diff: A file showing the differences between two versions of code (additions and deletions)

Tree-sitter: A parser generator tool and incremental parsing library used to build a concrete syntax tree for source files, enabling the agent to understand code structure

BM25: Best Matching 25—a ranking function used by search engines to estimate the relevance of documents to a given search query

RAG: Retrieval-Augmented Generation—optimizing LLM output by referencing an authoritative knowledge base outside its training data