Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

📝 Paper Summary

Agentic AI Prompt Optimization Behavioral Testing

TDAD treats agent prompts as compiled artifacts by converting natural language specifications into executable tests, then using coding agents to iteratively refine prompts until tests pass, while preventing gaming via hidden test splits and mutation testing.

Core Problem

Deploying tool-using LLM agents is risky because manual prompt engineering cannot verify behavior across all edge cases, leading to silent regressions and policy violations.

Why it matters:

Small prompt changes often cause silent regressions where fixing one issue breaks another (stability)
Teams cannot verify agent compliance with policies (e.g., PII leakage) across all scenarios before deployment (confidence)
Current evaluation workflows are disconnected from standard engineering CI/CD pipelines (integration)

Concrete Example: A product team hands an engineer a spec for a refund agent. The engineer manually tweaks the prompt to handle a 'happy path' refund but inadvertently breaks the logic for refusing refunds on non-refundable items, or introduces a PII leak when the user asks directly.

Key Novelty

Test-Driven Agent Compilation Pipeline

Formalizes agent development as a compilation process: Product Spec → Tests → Compiled Prompt (Agent)
Separates roles into specialized coding agents: TestSmith (writes tests from spec), PromptSmith (iterates prompt until tests pass), and MutationSmith (checks if tests catch bugs)
Mitigates 'specification gaming' (over-optimizing for specific tests) using hidden test splits and semantic mutation testing (generating faulty prompts to see if tests catch them)

Architecture

The TDAD pipeline roles and data flow

Evaluation Highlights

Achieved 92% compilation success rate for v1 specs and 58% for evolved v2 specs across 24 independent trials on SpecSuite-Core
Successful compilations maintained a 97% mean hidden pass rate (HPR) on v1, indicating strong generalization beyond the tests used for optimization
Demonstrated high regression safety with 97% Spec Update Regression Score (SURS) when evolving agents from v1 to v2 requirements

Breakthrough Assessment

8/10

Strong engineering methodology contribution. Applies rigorous software engineering principles (TDD, mutation testing) to the stochastic nature of agents, solving a critical reliability gap in production deployments.

⚙️ Technical Details

Problem Definition

Setting: Compiling a natural language product specification S into a prompt P such that P satisfies behavioral tests derived from S

Inputs: Product specification (YAML) containing tools, policies, decision tree, response contract, and test guidance

Outputs: Compiled agent artifact (system prompt + tool configuration)

Pipeline Flow

Product Spec (YAML) → TestSmith → Visible + Hidden Tests
PromptSmith → Iterative Compilation Loop (Run Visible Tests → Refine Prompt) → Compiled Agent
MutationSmith → Generate Mutants → Harness checks if Visible Tests fail → Mutation Score
Spec Evolution → Re-compile for v2 → Check Regression against v1 Hidden Tests

System Modules

TestSmith (Pipeline Agent)

Generates MFT, INV, and DIR tests and deterministic fixtures from the spec YAML

Model or implementation: Claude Code (via Docker)

PromptSmith (Pipeline Agent)

Compiles the prompt by iterating against visible tests

Model or implementation: Claude Code (via Docker)

MutationSmith

Generates semantic mutations of the compiled prompt to validate test suite quality

Model or implementation: Claude Code (via Docker)

Built Agent

Executes the compiled prompt to interact with users

Model or implementation: Claude Agent SDK

Novel Architectural Elements

Separation of concerns into adversarial/cooperative agent roles (TestSmith vs. PromptSmith vs. MutationSmith) to prevent specification gaming
Integration of semantic mutation testing where mutants are dynamically synthesized prompts rather than code edits
Two-loop compilation strategy (Full Loop vs. Focused Inner Loop) to optimize API costs and iteration time

Modeling

Base Model: Claude Code (for pipeline roles), Model under test varies (Claude Agent SDK)

Compute: Single spec version compilation takes 30-60 minutes wall-clock. Full v1+v2 pipeline takes 1-2 hours.

Comparison to Prior Work

vs. DSPy: Optimizes against behavioral decision trees/policies defined in natural language rather than code-level signatures and task accuracy metrics
vs. TextGrad/APE: Includes anti-gaming mechanisms (hidden tests, mutation testing) and focuses on compliance/policy adherence rather than just task performance
vs. SWE-bench: Focuses on the development workflow (PRD -> Agent) and regression safety, not just final agent performance evaluation

Limitations

Compilation costs time and tokens (30-60 mins per spec version)
Success depends on the quality of the initial specification and test guidance
Requires deterministic fixtures and reliable tool mocks, which can be complex to engineer for some domains

Reproducibility

Code: https://github.com/f-labs-io/tdad-paper-code

Publicly available at https://github.com/f-labs-io/tdad-paper-code. Includes SpecSuite-Core benchmark (4 agents), harness, and Docker infrastructure. SupportOps spec is fully worked with generated tests. Requires Claude Agent SDK.

📊 Experiments & Results

Evaluation Setup

Compilation and evaluation of agents using the SpecSuite-Core benchmark

Benchmarks:

SpecSuite-Core (Agent Compilation Benchmark) [New]

Metrics:

Compilation Success (Pass all visible tests)
VPR (Visible Pass Rate)
HPR (Hidden Pass Rate)
MS (Mutation Score)
SURS (Spec-Update Regression Score)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Compilation and generalization performance on SpecSuite-Core benchmark across 24 independent trials.
SpecSuite-Core (v1)	Compilation Success Rate	Not reported in the paper	92%	Not reported in the paper
SpecSuite-Core (v1)	Mean HPR (Hidden Pass Rate)	Not reported in the paper	97%	Not reported in the paper
SpecSuite-Core (v2)	Compilation Success Rate	Not reported in the paper	58%	Not reported in the paper
SpecSuite-Core (v2)	Mean HPR (Hidden Pass Rate)	Not reported in the paper	78%	Not reported in the paper
SpecSuite-Core (v1->v2)	Mean SURS	Not reported in the paper	97%	Not reported in the paper
SpecSuite-Core	Mutation Score	Not reported in the paper	86-100%	Not reported in the paper

Main Takeaways

High v1 compilation success (92%) proves feasibility of automating prompt engineering from specs.
Hidden Pass Rates of 97% (v1) and 78% (v2) demonstrate that agents compiled against visible tests generally do not 'game' the spec but learn generalized behaviors.
Mutation scores (86-100%) validate that the generated tests are rigorous enough to catch plausible semantic faults (e.g., skipping auth, leaking PII).
Regression safety is high (97%), showing that test-driven evolution protects existing functionality even when prompts are re-optimized for new requirements.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM agents and tool use (function calling)
Familiarity with software testing concepts (TDD, regression testing, mutation testing)
Basic knowledge of prompt engineering

Key Terms

TestSmith: Coding agent role responsible for converting the product specification into executable tests (visible and hidden)

PromptSmith: Coding agent role responsible for iteratively refining the agent prompt until the visible tests pass

MutationSmith: Coding agent role responsible for generating plausible faulty prompt variants to evaluate the strength of the test suite

MFT: Minimum Functionality Test—checks the basic required action for a specific leaf node in the decision tree

INV: Invariance Test—checks that the agent's behavior remains consistent when user inputs are paraphrased

DIR: Directional Expectation Test—checks that changing a specific input condition (e.g., changing an order value) changes the output as expected

canary values: Unique identifiers (e.g., specific fake SSNs) embedded in mock data that indicate a security failure if they appear in the agent's output

HPR: Hidden Pass Rate—the fraction of held-out tests (not seen by PromptSmith) that the compiled agent passes; measures generalization

SURS: Spec Update Regression Score—fraction of v1 invariant tests that still pass after the agent is compiled for v2 requirements

activation probe: A targeted test case used by MutationSmith to verify that a generated mutant prompt actually exhibits the intended faulty behavior before running the full test suite