Agentic Software Engineering: Foundational Pillars and a Research Roadmap

📝 Paper Summary

Agentic Software Engineering (SE 3.0) Human-Agent Collaboration Software Engineering Process Frameworks

The paper proposes Structured Agentic Software Engineering (SASE), a framework replacing ad-hoc prompting with dual workbenches (Human vs. Agent) and formal artifacts to bridge the gap between agent speed and engineering trust.

Core Problem

Autonomous coding agents generate code rapidly but fail to meet 'merge-ready' quality standards (hygiene, subtle regressions), creating a 'speed vs. trust' gap that overwhelms human reviewers.

Why it matters:

Current agents (e.g., Devin, Claude Code) are hyper-productive but unreliable, with high failure rates in broader CI checks despite passing unit tests
Ad-hoc conversational prompting fails to establish the robust processes needed for reproducibility, auditing, and scaling N-to-N human-agent collaboration
The industry lacks a standardized taxonomy for SE autonomy, confusing simple assistance (auto-complete) with true goal-oriented agency

Concrete Example: A coding agent might pass all unit tests for a 'caching layer' task but introduce subtle behavioral regressions or style violations (as seen in SWE-Bench analyses where 29.6% of plausible fixes were incorrect), forcing humans to painstakingly review massive volumes of generated code.

Key Novelty

Structured Agentic Software Engineering (SASE)

Proposes a 'Structured Duality' separating SE into two modalities: SE for Humans (strategic coaching via ACE) and SE for Agents (execution via AEE)
Replaces informal chat with structured, version-controlled artifacts (e.g., Merge-Readiness Packs, BriefingScripts) to manage the human-agent contract
Introduces a 6-level hierarchy for SE Autonomy, analogous to SAE driving levels, distinguishing between 'Task-Agentic' (Level 2) and 'Goal-Agentic' (Level 3) systems

Architecture

The Structured Agentic Software Engineering (SASE) framework visualization, contrasting the two modalities (SE4H and SE4A)

Breakthrough Assessment

8/10

Provides a necessary, rigorous conceptual scaffold (SASE) and taxonomy (Levels 0-5) to move the field from ad-hoc demos to disciplined engineering, though the framework itself is theoretical.

⚙️ Technical Details

Problem Definition

Setting: Orchestrating collaborative teams of humans and autonomous agents to deliver trustworthy software at scale

Inputs: High-level intent (BriefingScript), Strategy (LoopScript), Best Practices (MentorScript)

Outputs: Merge-Ready Software, verified via Merge-Readiness Packs (MRPs)

Pipeline Flow

Human Coach (defines intent via BriefingScript in ACE)
Agent Team (executes workflow in AEE)
Feedback Loop (exchanges CRPs/MRPs and VCRs)

System Modules

Agent Command Environment (ACE)

Command center for humans to specify intent, monitor observability signals, and review agent artifacts

Model or implementation: Human-centric UI/Dashboard

Agent Execution Environment (AEE)

Digital workbench for agents to execute tasks, run compilers/tests, and access tools without human UI overhead

Model or implementation: Agent-centric Sandbox/Environment

Novel Architectural Elements

Separation of concerns into two distinct environments (ACE for humans, AEE for agents) rather than a single shared IDE
Use of formal, machine-readable 'Pack' artifacts (MRP, CRP) as the primary interface between human and agent, replacing chat logs

Comparison to Prior Work

vs. Standard IDEs: SASE proposes removing agents from the human IDE (AEE) to allow massive parallelism without cognitive overload
vs. Chat-based Agents: SASE replaces 'chat' with structured, version-controlled artifacts (BriefingScript, MRP) for auditability
vs. PDAR: SASE expands the loop into a complete engineering methodology with specific roles and environment separation

Limitations

The framework is currently visionary and conceptual; no concrete implementation or empirical evaluation is presented
Achieving General Domain Autonomy (Level 5) is theoretically proposed but currently non-existent
Requires a fundamental shift in developer mindset from 'coding' to 'coaching' which may face adoption resistance

Reproducibility

This is a conceptual vision paper. No specific software implementation, model weights, or datasets are provided. The framework serves as a roadmap for future implementation.

📊 Experiments & Results

Evaluation Setup

This is a position paper proposing a roadmap. It analyzes existing industry trends and defines a taxonomy (Levels 0-5) rather than conducting empirical experiments.

Metrics:

Statistical methodology: Not applicable

Main Takeaways

The industry is transitioning from AI-Augmented SE (Level 2, e.g., Copilot) to Agentic SE (Level 3, e.g., Devin), necessitating a shift from 'task assistance' to 'goal delegation'
Current 'Speed vs. Trust' gap creates a bottleneck where human review of agent code negates productivity gains; structured artifacts (MRPs) are proposed to enforce quality before review
The definition of a '10x developer' is shifting from raw coding prowess to the ability to orchestrate fleets of agents (Agent Coaches)
True autonomy (Level 4/5) requires agents to specialize not just in tech stacks but in quality attributes (security, performance) across domains

📚 Prerequisite Knowledge

Prerequisites

Software Engineering (SE) lifecycle (CI/CD, Pull Requests)
Generative AI and Large Language Models
Autonomous Agents concepts (planning, tool use)

Key Terms

SASE: Structured Agentic Software Engineering—the proposed framework emphasizing structured artifacts and dual workbenches for human-agent collaboration

ACE: Agent Command Environment—a workbench optimized for human 'Agent Coaches' to strategize, orchestrate, and review agent work

AEE: Agent Execution Environment—a digital workbench optimized for agents to execute tasks, run tests, and perform massive parallel computation

MRP: Merge-Readiness Pack—a structured, agent-generated artifact presenting evidence-backed deliverables (code, test results) to prove a task is ready for merging

CRP: Consultation Request Pack—an agent-generated artifact formally requesting human expertise when the agent faces ambiguity or trade-offs

VCR: Version Controlled Resolution—an auditable human response to an agent's request, ensuring the collaboration loop is traceable

BriefingScript: A machine-readable mission plan authored by humans to define high-level intent for agents

SE 3.0: Agentic Software Engineering—the era where agents autonomously plan and execute goals (Goal-Agentic), succeeding AI-Augmented SE (SE 2.0)

Task-Agentic: Systems that execute a specific planned change (e.g., GitHub Copilot), corresponding to SAE Level 2

Goal-Agentic: Systems that plan and execute to achieve a high-level goal (e.g., Devin), corresponding to SAE Level 3