← Back to Paper List

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

M. M. Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan
Queen's University, Concordia University
arXiv.org (2025)
Agent Benchmark

📝 Paper Summary

Quality Assurance for AI Agents Software Testing Agent Framework Analysis
This empirical study reveals that developers of AI agent frameworks prioritize testing deterministic infrastructure like tools over the non-deterministic Foundation Model components, leaving prompts and planning logic largely unverified.
Core Problem
Testing Foundation Model (FM)-based agents is difficult due to their inherent non-determinism and non-reproducibility, yet there is no understanding of how developers actually verify internal correctness beyond high-level benchmarks.
Why it matters:
  • Agents deployed in real-world scenarios face edge cases, infinite loops, and hallucinations that standard benchmarks (like AgentBench) fail to detect.
  • Silent performance degradation can occur during model upgrades (e.g., prompt drift) without robust regression testing.
  • The rapid evolution of agent components (tools, memory, planning) creates a complex architecture where failure modes are poorly understood.
Concrete Example: A developer builds a storytelling agent. Initially, it works fine. Later, a silent update to the underlying FM alters how it interprets prompts, causing the stories to become incoherent. Because the developer only used high-level benchmarks and no specific unit tests for the prompt (Trigger component), this degradation goes undetected until user complaints arrive.
Key Novelty
Canonical Mapping of Agent Testing Practices
  • Maps ad-hoc testing practices in open-source projects to a stable, canonical agent architecture (extended JaCaMo framework) to identify where testing effort is focused.
  • Identifies a 'Testing Inversion': Unlike traditional ML where the model is the focus, agent developers heavily test deterministic tools and parsers while neglecting the core FM-driven planning and prompting logic.
  • Catalogs specific adaptation strategies developers use to handle non-determinism, such as relaxing assertions (Membership Testing) rather than using strict equality.
Evaluation Highlights
  • Analyzed 39 open-source agent frameworks and 439 agentic applications, identifying 10 distinct testing patterns.
  • Resource Artifacts (tools/parsers) consume 29.7% of testing effort in frameworks and 40.1% in applications, dominating the test suites.
  • The Trigger component (prompts) is critically under-tested, appearing in only ~1% of all test functions, representing a major blind spot for regression testing.
Breakthrough Assessment
8/10
First large-scale empirical baseline for agent testing. It exposes a critical gap in current development practices (the neglect of prompt testing) and provides a necessary taxonomy for future quality assurance research.
×