← Back to Paper List

HLER: Human-in-the-Loop Economic Research via Multi-Agent Pipelines for Empirical Discovery

Chen Zhu, Xiaolu Wang
China Agricultural University
arXiv (2026)
Agent Reasoning Benchmark

📝 Paper Summary

Multi-agent systems for scientific discovery Automated empirical research
HLER is a multi-agent pipeline that automates empirical economics research by constraining hypothesis generation to actual dataset properties and integrating human oversight for question selection and final approval.
Core Problem
Existing AI research agents often hallucinate infeasible hypotheses and struggle with the specific procedural rigor of empirical economics, such as identification strategies and data constraints.
Why it matters:
  • Unconstrained LLMs frequently propose research questions that require variables not present in the dataset, leading to wasted compute and dead ends.
  • Credible social science requires careful identification strategies and human judgment on economic significance, which fully autonomous 'AI Scientists' often lack.
  • Reproducibility in economics is fragile; automating the workflow with transparent code generation can help standardized evidence generation.
Concrete Example: An unconstrained LLM might propose studying the 'impact of remote work on rural wages' using a dataset that contains no remote work variable. HLER avoids this by first auditing the dataset schema and only generating questions compatible with available variables.
Key Novelty
Dataset-Aware Human-in-the-Loop Economic Research
  • Implements a 'dataset-aware' hypothesis generation mechanism that conditions LLM brainstorming on a structured audit of the dataset's variables and statistical distributions.
  • Uses a dual-loop architecture: a 'Question Quality Loop' for human selection of hypotheses, and a 'Research Revision Loop' where an automated reviewer agent iteratively critiques and requests re-analysis.
Evaluation Highlights
  • Dataset-aware generation produced feasible research questions in 87% of cases, compared to only 41% for unconstrained LLM ideation.
  • Reduced the rate of infeasible/hallucinated hypotheses from 59% (unconstrained) to 13% (dataset-aware).
  • Successfully produced complete end-to-end empirical manuscripts across 14 runs at an average API cost of $0.8-$1.5 per paper.
Breakthrough Assessment
7/10
Significant practical step in constraining AI scientific discovery to reality. While not a new fundamental algorithm, the architectural integration of data auditing and human gates solves a major pain point in applied AI research.
×