Measuring agents in production

📝 Paper Summary

Agent deployment Production systems

A systematic study of 86 deployed agents reveals that practitioners prioritize simplicity and reliability over advanced autonomy, favoring static workflows and human evaluation over complex planning and automatic benchmarking.

Core Problem

Despite widespread excitement and research on complex autonomous agents, little is known about how successful agents are actually built and deployed in production, creating a gap between research directions and real-world engineering needs.

Why it matters:

Agent deployments often fail or underdeliver, yet successful patterns remain proprietary
Research focuses on latency reduction and complex reasoning (RL, dynamic planning), while industry may value different attributes
The field lacks shared data on the technical methods that enable reliability in real-world applications

Concrete Example: While researchers optimize for sub-second latency, 66% of production agents allow response times of minutes because they replace even slower human workflows (e.g., clinicians seeking insurance approval). Research emphasizes fully autonomous planning, but 68% of production agents are constrained to fewer than 10 steps to ensure reliability.

Key Novelty

MAP (Measuring Agents in Production)

First large-scale empirical study collecting primary data from 20 in-depth interviews and 306 survey responses (86 deployed systems) across 26 domains
Characterizes the 'simplicity-first' engineering paradigm of production agents: static workflows, human-in-the-loop evaluation, and reliance on prompting frontier models rather than fine-tuning

Architecture

Prevalence of Model Types and Custom vs. Framework implementations (Visual summary of technical stack)

Evaluation Highlights

70% of deployed agents rely on prompting off-the-shelf models rather than weight tuning (SFT/RL)
68% of production agents execute at most 10 steps before human intervention, prioritizing control over long-horizon autonomy
74% of teams depend primarily on human-in-the-loop evaluation rather than automated benchmarks

Breakthrough Assessment

9/10

Provides rare, grounded evidence contradicting common research assumptions (e.g., latency sensitivity, need for RL). Essential reading for aligning agent research with reality.

⚙️ Technical Details

Problem Definition

Setting: Empirical analysis of production software systems integrating LLMs for multi-step tasks

Inputs: Survey data (306 respondents) and interview transcripts (20 case studies)

Outputs: Taxonomy of design decisions, challenges, and architectural patterns

Pipeline Flow

Data Collection (Interviews + Survey)
Data Filtering (Verified Production/Pilot status)
Analysis (Qualitative Coding + Quantitative Statistics)

System Modules

Case Studies (Data Collection)

Deep dive into 20 specific production systems via semi-structured interviews

Model or implementation: Human Interviewers

Survey (Data Collection)

Broad quantitative data collection

Model or implementation: Qualtrics Survey

Novel Architectural Elements

First systematic taxonomy of production agent architectures (e.g., finding that 85% of case studies use custom implementations over frameworks like LangChain)
Identification of the 'simplicity-first' reliability pattern: constraining agent autonomy to <10 steps and relying on static workflows

Modeling

Base Model: Anthropic Claude (Sonnet 4, Opus 4.1) and OpenAI GPT (o3) [Most common in case studies]

Training Method: Prompting (70% of cases) vs. Fine-tuning/RL (30% of cases)

Adaptation: Prompt Engineering (Manual or LLM-assisted)

Trainable Parameters: None (for the majority of production systems)

Compute: Not reported in the paper

Comparison to Prior Work

vs. LangChain Survey: MAP focuses strictly on technical engineering decisions (architectures, models, latency) of deployed systems vs. general sentiment
vs. Academic Surveys: MAP reveals that production systems ignore common research techniques (RL, dynamic planning) in favor of reliability via constraints
vs. Single-System Reports: MAP aggregates data across 26 domains (Finance, Healthcare, etc.) to find universal deployment patterns

Limitations

Geographic bias toward Americas and Europe in case studies
Participation bias: successful teams are more likely to share data than failed ones
Temporal bias: rapid field evolution means 'state-of-the-art' models changed during the 7-month study
Reliance on self-reported data from practitioners rather than direct instrumentation of their systems

Reproducibility

Code: https://github.com/swe-agent/map

Survey questions are provided in Appendix G. Full raw data is not public to protect participant anonymity, but aggregated data and methodology are detailed. The study analyzes proprietary systems, so the agents themselves are not reproducible.

📊 Experiments & Results

Evaluation Setup

Survey and Interview analysis of real-world practices

Metrics:

Prevalence of techniques (percentage of systems using X)
Latency requirements
Step counts
Success metrics (Productivity vs. Quality)
Statistical methodology: 95% confidence intervals computed using 1,000 bootstrap samples with replacement for categorical comparisons

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Practitioners overwhelmingly favor simple, controllable techniques to ensure reliability in production.
Survey of Deployed Agents	Percentage using Off-the-Shelf Models	30	70	+40
Survey of Deployed Agents	Percentage executing ≤10 steps	32	68	+36
Survey of Deployed Agents	Percentage using Human Evaluation	25	74	+49
Survey of Deployed Agents	Percentage citing Productivity	Not reported in the paper	80	Not reported in the paper

Experiment Figures

Latency requirements for deployed agents

Number of steps executed before human intervention

Main Takeaways

Reliability is achieved through system-level design (constraints, human loops) rather than algorithmic advances (RL, automated planning)
Latency is not a primary bottleneck: 66% of agents allow minutes-scale response times because they replace even slower human work
Custom implementations dominate: 85% of case studies build in-house rather than using frameworks like LangChain, citing flexibility and security
Evaluation is a major gap: Teams rely on A/B testing and expert feedback because relevant benchmarks do not exist for their specific domains

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of LLM agents (tools, planning, memory)
Familiarity with software engineering deployment concepts (latency, CI/CD, reliability)
Knowledge of evaluation methods (benchmarks vs. human review)

Key Terms

SFT: Supervised Fine-Tuning—training a model on labeled examples to specialize it for a task

RL: Reinforcement Learning—training agents to maximize a reward signal through trial and error

Agentic RAG: Retrieval-Augmented Generation where the retrieval process is actively managed by an agent (e.g., via tool calls) rather than a fixed pipeline

DSPy: A framework for programming language models that optimizes prompts automatically

LangChain: A popular open-source framework for building applications with LLMs

Closed-source models: Proprietary models like GPT-4 or Claude whose weights and training data are not public

Human-in-the-loop: Systems designed so that a human must review, approve, or interact with the agent's output at critical steps

System-level design: Improving reliability through architecture (guardrails, retries, constraints) rather than improving the core AI model itself