An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

📝 Paper Summary

Quality Assurance for AI Agents Software Testing Agent Framework Analysis

This empirical study reveals that developers of AI agent frameworks prioritize testing deterministic infrastructure like tools over the non-deterministic Foundation Model components, leaving prompts and planning logic largely unverified.

Core Problem

Testing Foundation Model (FM)-based agents is difficult due to their inherent non-determinism and non-reproducibility, yet there is no understanding of how developers actually verify internal correctness beyond high-level benchmarks.

Why it matters:

Agents deployed in real-world scenarios face edge cases, infinite loops, and hallucinations that standard benchmarks (like AgentBench) fail to detect.
Silent performance degradation can occur during model upgrades (e.g., prompt drift) without robust regression testing.
The rapid evolution of agent components (tools, memory, planning) creates a complex architecture where failure modes are poorly understood.

Concrete Example: A developer builds a storytelling agent. Initially, it works fine. Later, a silent update to the underlying FM alters how it interprets prompts, causing the stories to become incoherent. Because the developer only used high-level benchmarks and no specific unit tests for the prompt (Trigger component), this degradation goes undetected until user complaints arrive.

Key Novelty

Canonical Mapping of Agent Testing Practices

Maps ad-hoc testing practices in open-source projects to a stable, canonical agent architecture (extended JaCaMo framework) to identify where testing effort is focused.
Identifies a 'Testing Inversion': Unlike traditional ML where the model is the focus, agent developers heavily test deterministic tools and parsers while neglecting the core FM-driven planning and prompting logic.
Catalogs specific adaptation strategies developers use to handle non-determinism, such as relaxing assertions (Membership Testing) rather than using strict equality.

Evaluation Highlights

Analyzed 39 open-source agent frameworks and 439 agentic applications, identifying 10 distinct testing patterns.
Resource Artifacts (tools/parsers) consume 29.7% of testing effort in frameworks and 40.1% in applications, dominating the test suites.
The Trigger component (prompts) is critically under-tested, appearing in only ~1% of all test functions, representing a major blind spot for regression testing.

Breakthrough Assessment

8/10

First large-scale empirical baseline for agent testing. It exposes a critical gap in current development practices (the neglect of prompt testing) and provides a necessary taxonomy for future quality assurance research.

⚙️ Technical Details

Problem Definition

Setting: Empirical software engineering study analyzing the abstract syntax trees (AST) and testing patterns of open-source software repositories.

Inputs: Source code from 39 agent frameworks and 439 agentic applications hosted on GitHub.

Outputs: Taxonomy of 10 testing patterns and distribution metrics of testing effort across 13 canonical architectural components.

Pipeline Flow

Repository Selection (Filter GitHub for agent frameworks/apps)
Test Extraction (Identify test files and functions using AST parsing)
Pattern Analysis (Classify testing patterns using keywords and manual inspection)
Architecture Mapping (Map SUTs to canonical JaCaMo components)

System Modules

Repository Selection

Identify relevant open-source projects

Architecture Mapper

Classify the Subject Under Test (SUT) into canonical components

Novel Architectural Elements

Extension of the JaCaMo multi-agent conceptual architecture to include 'Registry' components, reflecting modern tool discovery standards (like MCP).

Reproducibility

The paper presents an empirical study. The dataset of 39 agent frameworks and 439 agentic applications is described as a contribution (

📊 Experiments & Results

Evaluation Setup

Static analysis and manual qualitative coding of software repositories.

Benchmarks:

Dataset of 39 Agent Frameworks (Source Code Analysis) [New]
Dataset of 439 Agentic Applications (Source Code Analysis) [New]

Metrics:

Percentage of test functions utilizing specific testing patterns
Percentage of test functions targeting specific architectural components
Statistical methodology: Descriptive statistics (frequencies and percentages). No hypothesis testing reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Distribution of testing effort across architectural components shows a heavy bias toward deterministic parts.
Agent Frameworks	Percentage of Tests	5.0	29.7	+24.7
Agentic Applications	Percentage of Tests	5.0	40.1	+35.1
Combined Dataset	Percentage of Tests	99.0	1.0	-98.0
Adoption of testing patterns highlights reliance on traditional methods over AI-specific ones.
Combined Dataset	Adoption Rate	Not reported in the paper	1.0	Not reported in the paper

Main Takeaways

Inversion of Testing Effort: Developers focus on testing the tools (Resource Artifacts) and workflows (Coordination Artifacts) rather than the AI brain itself, likely due to the difficulty of testing non-deterministic outputs.
The 'Trigger' (Prompt) Blind Spot: Despite prompts being critical to agent performance, they are rarely tested (1%), posing significant risks for regression during model updates.
Preference for Traditional Patterns: Instead of using new AI-evaluators (like DeepEval), developers adapt traditional software testing patterns (Mocking, Membership Testing, Negative Testing) to cope with uncertainty.
Need for Standardization: The ecosystem lacks standardized testing methodologies, leading to ad-hoc, fragile testing suites.

📚 Prerequisite Knowledge

Prerequisites

Software Testing Fundamentals (Unit testing, AAA pattern, Mocks)
Agent Architecture concepts (Tools/Function calling, Planning, Memory)
Foundation Models (LLMs) and their non-deterministic nature

Key Terms

Foundation Model (FM): A large-scale machine learning model (like GPT-4) trained on vast data that serves as the 'brain' for AI agents.

Agentic Application: Software that uses an FM to perceive, reason, and act to achieve goals, often using tools and memory.

SUT (Subject Under Test): The specific part of the software (function, class, module) being verified by a test.

AAA Pattern: Arrange-Act-Assert: A standard structure for unit tests involving setting up state, executing the function, and verifying the result.

JaCaMo: A classic conceptual framework for multi-agent systems describing agents, environments, and organizations; used here as a reference architecture.

Non-determinism: The property of FMs where the same input may produce different outputs, complicating traditional equality-based testing.

Resource Artifacts: Deterministic tools or APIs the agent uses, such as a calculator or database connection.

Trigger: The component responsible for initiating an agent's plan, typically the prompt sent to the FM.

DeepEval: A specialized testing framework designed to evaluate LLM outputs using metrics like hallucination or faithfulness scores.

Mock Assertion: A testing pattern where the test verifies that a dependency was called (e.g., 'tool was invoked') rather than checking the actual output.