TestForge: Feedback-Driven, Agentic Test Suite Generation

📝 Paper Summary

Software Testing Agentic AI

TestGenAgent iteratively refines zero-shot generated test suites using execution logs and coverage reports to produce high-quality, readable unit tests at low cost.

Core Problem

Existing automated testing tools are either unreadable (search-based) or prone to hallucinations and lack of context (LLM-based), while current agentic approaches are prohibitively expensive.

Why it matters:

Manual test creation is costly and often neglected, leading to buggy software
Search-based tools like EvoSuite generate code that developers find difficult to maintain or debug
Prior agentic methods operate at the method level, costing over $2.00 per file, which prevents scaling to large repositories

Concrete Example: In `pydata/xarray`, GPT-4o generates a test that fails due to a timestamp `AttributeError`. TestGenAgent reads the error log, fixes the bug, sees `_round` method is uncovered in the coverage report, and adds a targeted test.

Key Novelty

Iterative Feedback-Driven Refinement at File Level

Starts with a zero-shot LLM-generated test suite rather than starting from scratch, using it as a high-quality template
Feeds full coverage reports and execution error logs back to the agent, allowing it to plan fixes and target specific uncovered lines
Operates at the file level (testing all methods in a file at once) rather than per-method, significantly reducing token costs

Architecture

The workflow of TestGenAgent, illustrating the cycle between agent actions and the execution environment.

Evaluation Highlights

Achieves 84.3% Pass@1 rate on the TestGenEval benchmark, setting a new record for automated test generation
Improves mutation score (ability to catch bugs) by 15.4 percentage points over the one-iteration LLM baseline
Reduces cost to $0.63 per file, substantially cheaper than prior agentic methods that cost over $2.00 per file

Breakthrough Assessment

8/10

Significantly improves practical usability of automated testing by combining high coverage with readability and low cost. The shift to file-level processing and iterative refinement addresses key scalability bottlenecks.

⚙️ Technical Details

Problem Definition

Setting: Automated unit test generation for Python files in existing repositories

Inputs: Source code file under test

Outputs: Executable unit test suite (Python file) achieving high coverage and correctness

Pipeline Flow

Initialization: Zero-Shot Generator → Initial Test Suite
Agent Loop: Agent Core → Tool Executor → Environment Feedback → Agent Core (Repeat)

System Modules

Zero-Shot Generator

Generate the initial draft of the test suite to serve as a starting template

Model or implementation: GPT-4o

Agent Core (Agent Loop)

Analyze feedback, reflect on failures/coverage gaps, and plan next actions

Model or implementation: GPT-4o (via LiteLLM)

Tool Executor (Agent Loop)

Execute the specific actions decided by the Agent Core

Model or implementation: Python Tool Functions

Environment Feedback (Agent Loop)

Run the tests and analysis tools to provide dynamic feedback

Model or implementation: Dockerized Execution Environment

Novel Architectural Elements

Integration of full coverage reports (missing lines) directly into the agent's observation space to guide multi-test generation
Strict separation of 'Drafting' (Zero-shot) and 'Refining' (Agentic loop) phases to balance quality and cost

Modeling

Base Model: GPT-4o

Compute: Not reported in the paper (Inference-only approach, no training)

Comparison to Prior Work

vs. Pynguin: Produces human-readable code and fixes compilation errors automatically
vs. CoverUp: Operates at the file level (lowering cost) and refines a zero-shot template rather than generating from scratch per method
vs. HITS: Scales better to long contexts by avoiding heavy slicing overhead and processing entire files at once

Limitations

Relies on the availability and cost of the underlying LLM (GPT-4o)
Success depends on the quality of the initial zero-shot draft
Evaluation is limited to Python projects in the TestGenEval benchmark

Reproducibility

Code: https://anonymous.4open.science/r/OpenHands-7E28/

📊 Experiments & Results

Evaluation Setup

Unit test generation for real-world open source Python repositories

Benchmarks:

TestGenEval (Unit Test Generation)

Metrics:

Pass@1 Rate
Line Coverage
Mutation Score
Cost per file
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TestGenEval	Mutation Score	18.4	33.8	+15.4
TestGenEval	Pass@1	25.0	84.3	+59.3
TestGenEval	Cost per file ($)	2.00	0.63	-1.37

Main Takeaways

TestGenAgent significantly outperforms classical search-based techniques (Pynguin) in terms of test readability and correctness.
The agentic feedback loop allows the system to fix bugs in generated tests (increasing Pass@1) and target uncovered code (increasing Coverage/Mutation Score).
Processing at the file level is far more cost-effective ($0.63/file) than method-level agentic approaches ($>2.00/file) while maintaining high quality.

📚 Prerequisite Knowledge

Prerequisites

Software Testing Concepts (Unit tests, Assertions)
Large Language Models (Prompting, Agents)
Dynamic Analysis (Coverage, Mutation testing)

Key Terms

Pass@1: The percentage of generated test suites that execute successfully without errors on the first attempt (or final attempt in this context)

Mutation Score: A metric measuring test quality by checking if tests fail when artificial bugs (mutations) are injected into the code; higher is better

Line Coverage: The percentage of executable code lines that are executed during the test run

Zero-shot prompting: Asking an LLM to perform a task without providing any specific examples of that task in the prompt

Hallucination: When an LLM generates code or facts that look plausible but are incorrect or reference non-existent variables/methods

Agentic AI: Systems where an LLM acts as a reasoning engine to autonomously plan and execute a sequence of actions (using tools) to achieve a goal

LiteLLM: A library that provides a unified interface to call various LLM APIs (like OpenAI, Anthropic)

OpenHands: An open-source platform for developing and evaluating software engineering agents (formerly OpenDevin)