Automated Unit Test Improvement using Large Language Models at Meta

📝 Paper Summary

LLM-based Software Engineering Automated Test Generation

TestGen-LLM automatically improves existing human-written unit tests at Meta by generating new test cases that are verified via strict filtration to guarantee compilation, reliability, and increased code coverage.

Core Problem

Automated test generation in large industrial codebases faces challenges regarding trust, hallucination, and regression, often producing flaky or duplicate tests that waste engineering resources.

Why it matters:

LLM hallucinations make generated code unreliable for production without rigorous verification
Industrial scale (millions of lines of code) requires automated solutions that integrate into existing workflows without increasing maintenance burden
Regressions and flaky tests disrupt Continuous Integration (CI) systems, costing significant developer time

Concrete Example: An LLM might generate a test case that calls a non-existent method (hallucination) or passes sometimes but fails others (flakiness). Without filtration, this code would break the build. TestGen-LLM filters these out, keeping only tests that build, pass consistently, and cover previously missed lines (e.g., covering a specific 'early return' statement).

Key Novelty

Assured Offline LLM-Based Software Engineering (Assured Offline LLMSE)

Treats LLM output not as final code but as candidate suggestions that must pass a rigorous, automated filtration pipeline before being recommended to humans
Guarantees improvement by discarding any test that does not measurably increase coverage over the existing suite
Guarantees non-regression by only adding new test cases to existing classes, never modifying or deleting existing stable tests

Architecture

Top-level architecture of TestGen-LLM as an Assured Offline LLMSE system

Evaluation Highlights

73% of TestGen-LLM's recommendations were accepted by Meta engineers for production deployment during test-a-thons
25% of generated test classes increased coverage (building correctly and passing reliably) in an evaluation on Instagram's Reels and Stories
Improved 11.5% of all classes to which it was applied during Meta's Instagram and Facebook test-a-thons

Breakthrough Assessment

8/10

High score due to the unprecedented scale of industrial deployment and the high acceptance rate (73%) of LLM-generated code, proving the viability of 'Assured LLMSE' in production environments.

⚙️ Technical Details

Problem Definition

Setting: Extending existing Kotlin unit test classes to increase code coverage of the corresponding class under test

Inputs: Existing human-written Kotlin test class (and optionally the class under test)

Outputs: Extended test class containing additional valid test cases that increase coverage

Pipeline Flow

LLM Generation (Multiple candidates) → Build Filter → Execution Filter → Flakiness Filter → Coverage Filter → Post-Processing

System Modules

LLM Generator

Generate candidate test cases based on existing test classes

Model or implementation: Internal Meta LLMs (LLM1 and LLM2)

Build Filter (Filtration)

Discard any candidate code that does not compile

Model or implementation: Standard Kotlin Compiler / Build System

Execution Filter (Filtration)

Discard tests that fail on first execution

Model or implementation: Unit Test Runner

Flakiness Filter (Filtration)

Detect and discard unreliable tests

Model or implementation: Repeated Execution Runner

Coverage Filter (Filtration)

Ensure the test adds value

Model or implementation: Coverage Analysis Tool

Novel Architectural Elements

Filtration pipeline designed specifically for 'Assured' generation: strict multi-stage verification (Build → Pass → Flakiness → Coverage) ensuring zero regression and guaranteed improvement

Modeling

Base Model: Two internal Meta LLMs (referred to as LLM1 and LLM2)

Compute: Not reported in the paper

📊 Experiments & Results

Evaluation Setup

Deployment on industrial codebases (Instagram and Facebook Android apps) during engineering 'test-a-thons'

Benchmarks:

Instagram Reels and Stories (Unit Test Generation)
Instagram App (General) (Unit Test Generation)
Facebook App (Unit Test Generation)

Metrics:

Build rate (percentage of generated tests that compile)
Pass rate (percentage of tests that pass reliably)
Coverage improvement rate (percentage of tests increasing line coverage)
Acceptance rate (percentage of recommendations accepted by engineers)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation on 86 Kotlin components (Reels and Stories) to determine filtration funnel statistics.
Instagram Reels and Stories	Build Rate	0	75	+75
Instagram Reels and Stories	Reliable Pass Rate	0	57	+57
Instagram Reels and Stories	Coverage Improvement Rate	0	25	+25
Deployment results from Instagram and Facebook test-a-thons showing acceptance and success rates.
Instagram/Facebook Test-a-thons	Acceptance Rate	0	73	+73
Instagram/Facebook Test-a-thons	Improvement Rate	0	11.5	+11.5
First Instagram Test-a-thon	Rank by tests landed	Not reported in the paper	6	Not reported in the paper

Experiment Figures

Sankey diagram of the filtration process outcomes for the experimental study on Instagram Reels and Stories

Main Takeaways

Providing the 'class under test' in the prompt (RAG) improves results, but the model can still find unique tests using only the existing test class code.
Temperature 0.0 was selected as the default after showing competitive success rates (4% per trial) compared to higher temperatures, while being more deterministic.
Lower success rates per individual trial (4-5%) are acceptable because the process is automated and offline; the aggregate value (10% of classes improved) is significant at scale.
Human engineers preferred 'diff time' recommendations (when they are actively working on code) over 'post-land' recommendations.
Generated tests often hit valid corner cases (nulls, empty lists) that humans missed, though they typically added small increments of coverage (median 2.5 lines).

📚 Prerequisite Knowledge

Prerequisites

Unit testing frameworks (specifically Kotlin)
Continuous Integration (CI) workflows
Code coverage metrics
Large Language Models (LLMs) for code generation

Key Terms

Assured Offline LLMSE: Assured Offline LLM-Based Software Engineering—embedding LLMs in a workflow that filters outputs to provide verifiable guarantees (e.g., code compiles, tests pass) before human review

Flaky tests: Tests that exhibit non-deterministic behavior, passing and failing on the same code without changes

Diff: A set of changes to the codebase submitted for review (short for differential)

Hallucination: When an LLM generates code that looks syntactically correct but references non-existent variables, methods, or logic

Corner cases: Input scenarios that occur outside normal operating parameters (e.g., null values, empty lists)

Test-a-thon: A focused event where engineers dedicate time specifically to writing and improving tests

Regression testing: Re-running tests to ensure that recent code changes have not broken existing functionality

Test oracle: A mechanism for determining whether the output of a program is correct for a given input

Retrieval Augmented Generation: Providing relevant context (like the class under test) in the prompt to help the LLM generate better code