Automated Hypothesis Validation with Agentic Sequential Falsifications

📝 Paper Summary

Agentic data analysis Automated scientific discovery Hypothesis testing

POPPER is an agentic framework that validates abstract free-form hypotheses by iteratively designing and executing specific falsification experiments while maintaining strict Type-I error control via e-values.

Core Problem

Validating abstract natural language hypotheses is difficult because they cannot be tested directly, and LLM-generated hypotheses are voluminous and prone to hallucination.

Why it matters:

Directly verifying broad statements like 'Gene X causes Disease Y' is infeasible; they must be translated into measurable implications
Without rigorous statistical control, automated systems risk high Type-I error rates (false discoveries), wasting resources on incorrect theories
Existing LLM agents lack mechanisms to aggregate evidence from multiple tests while preserving statistical validity

Concrete Example: For the hypothesis 'Gene ZAP70 regulates IL-2 production', a standard agent might find a single correlation in a dataset and falsely claim verification. POPPER instead sequentially tests distinct implications (e.g., protein interactions, then tissue expression correlations, then eQTL associations), aggregating evidence to avoid false positives.

Key Novelty

Agentic Sequential Falsification with E-values

Adopts Karl Popper's falsification principle: instead of proving a hypothesis, the system iteratively attempts to refute specific sub-hypotheses (measurable implications) derived from the main claim
Uses a sequential testing framework based on e-values (rather than p-values) to aggregate evidence from dependent, adaptively chosen experiments, allowing optional stopping while controlling Type-I error

Architecture

The iterative workflow of POPPER for hypothesis validation.

Evaluation Highlights

Maintains Type-I error ≤ 0.1 across biology and sociology benchmarks, whereas standard LLM agents (CodeGen, ReAct) fail (errors up to 0.248)
Achieves 63.8% power on DiscoveryBench, outperforming ReAct (38.3%) and CodeGen (37.8%) by substantial margins while maintaining validity
Matches human expert performance in error control and power on biological tasks but completes validation 9.7x faster

Breakthrough Assessment

9/10

Significantly advances automated science by solving the critical problem of statistical rigor in LLM agents. The integration of e-values with agentic reasoning is a methodological leap for reliable discovery.

⚙️ Technical Details

Problem Definition

Setting: Hypothesis validation function f: H → {0,1}, where 0 is unvalidated and 1 is validated (rejecting the null)

Inputs: Natural language hypothesis H and access to datasets D

Outputs: Binary validation decision ŷ with Type-I error control (P(ŷ=1|H0) ≤ α)

Pipeline Flow

Loop until budget or rejection: Experiment Design Agent (proposes test) → Relevance Checker (filters invalid tests) → Experiment Execution Agent (runs test & gets p-value) → E-value Aggregation

System Modules

Experiment Design Agent (Design)

Propose a falsifiable sub-hypothesis h_i (null/alternative) and experiment plan based on main hypothesis H and history

Model or implementation: Claude-Sonnet-3.5 (default)

Relevance Checker (Design)

Verify that the proposed null sub-hypothesis h_0 is logically implied by the main null hypothesis H_0

Model or implementation: LLM-as-a-judge

Experiment Execution Agent

Implement the experiment plan by writing and executing Python code to analyze data

Model or implementation: Claude-Sonnet-3.5 with ReAct framework

Sequential Aggregator

Convert p-values to e-values and update cumulative evidence

Model or implementation: Deterministic algorithm

Novel Architectural Elements

Integration of LLM-based experiment design with valid sequential e-value aggregation
Use of a Relevance Checker module specifically to enforce logical implication (Assumption 1) for statistical validity

Modeling

Base Model: Claude-Sonnet-3.5 (primary), compared with GPT-4o, o1, Llama-3.3-70B

Comparison to Prior Work

vs. Fisher's Combined Test: POPPER allows dependent, sequentially chosen tests via e-values, whereas Fisher assumes independence or fixed designs
vs. ReAct/CodeGen: POPPER explicitly models statistical error control, preventing hallucinated validations common in standard agentic loops
vs. LLM-Likelihood Ratio: POPPER uses rigorous p-to-e calibration rather than relying on LLM to estimate likelihood ratios directly, ensuring validity
+ 1 more
vs. ChemCrow [not cited in paper]: ChemCrow uses tools for synthesis but lacks a statistical falsification logic for hypothesis testing

Limitations

Depends on the availability of relevant datasets; cannot validate hypotheses requiring unavailable data
Relevance checker is an LLM approximation; if it fails to detect irrelevant sub-hypotheses, error control may weaken
Requires high-capability models (e.g., Sonnet 3.5, GPT-4o); weaker models (Haiku) fail to control error
Currently instantiated on static databases; real-world wet-lab integration is conceptual

Reproducibility

Code: https://github.com/snap-stanford/POPPER

Code is publicly available at https://github.com/snap-stanford/POPPER. Paper uses standard benchmarks (DiscoveryBench) and a newly constructed TargetVal benchmark from public biological data (GTEx, GWAS Catalog, etc.). All LLM prompts and experimental setups are described.

📊 Experiments & Results

Evaluation Setup

Validation of ground-truth hypotheses against large-scale datasets (TargetVal, DiscoveryBench)

Benchmarks:

Target Validation (TargetVal) (Biological genotype-phenotype validation (IL2, IFNG)) [New]
DiscoveryBench (Hypothesis testing across 6 domains (Sociology, Biology, Economics, etc.))

Metrics:

Type-I Error Rate (False Positive Rate)
Power (True Positive Rate)
Statistical methodology: Bootstrap/permutation to generate null distributions for Type-I error estimation. Standard deviation reported over 5 runs.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Type-I error control results showing POPPER stays within the nominal alpha=0.1 limit while baselines fail.
DiscoveryBench	Type-I Error (α=0.1)	0.248	0.103	-0.145
TargetVal-IL2	Type-I Error (α=0.1)	0.264	0.082	-0.182
Power analysis showing POPPER achieves high discovery rates among valid methods.
DiscoveryBench	Power	0.383	0.638	+0.255
TargetVal-IL2	Power	0.183	0.580	+0.397

Experiment Figures

Trajectory of POPPER's reasoning and evidence accumulation.

Comparison between POPPER and human experts on TargetVal-IL2.

Main Takeaways

Standard LLM agents (CodeGen, ReAct) fail to balance Type-I error and power; they are either too liberal (hallucinating discoveries) or too conservative.
Sequential e-value aggregation allows POPPER to accumulate diverse evidence (expression, interaction, genetics) to boost power without inflating error.
The Relevance Checker is critical: removing it (POPPER-NoReleCheck) increased Type-I error on TargetVal-IL2 from 0.082 to 0.340.
Human expert study confirms POPPER matches PhD-level performance in accuracy but operates ~10x faster.

📚 Prerequisite Knowledge

Prerequisites

Hypothesis testing (Null vs. Alternative)
Type-I error and Power
Martingales and sequential testing

Key Terms

falsification: The principle that scientific hypotheses cannot be proven true, only rejected (falsified) by contrary evidence

e-value: A non-negative random variable with expectation ≤ 1 under the null hypothesis, used for accumulating evidence in sequential testing

Type-I error: The probability of incorrectly rejecting a true null hypothesis (false positive)

p-to-e calibrator: A function that converts a standard p-value into an e-value (e.g., e = κ * p^(κ-1)) to allow evidence aggregation

optional stopping: The ability to stop gathering evidence at any time based on the data observed so far without invalidating statistical guarantees

sub-hypothesis: A specific, measurable implication derived from a broad, abstract main hypothesis (e.g., 'Expression of X correlates with Y' implies 'X regulates Y')

ReAct: Reasoning + Acting—a paradigm where LLMs generate reasoning traces and task-specific actions (like code execution) in an interleaved manner