Evaluation Setup
End-to-end generation of research papers using 3 empirical datasets (China Health and Nutrition Survey, CMGPD-Liaoning, UK Biobank).
Benchmarks:
- Dataset-Aware vs. Unconstrained Generation (Hypothesis Feasibility Check) [New]
Metrics:
- Feasibility rate of generated questions (%)
- Infeasible/Hallucinated hypothesis rate (%)
- API Cost per run ($)
- Reviewer scores (1-10)
- Statistical methodology: Descriptive statistics comparing rates across pipeline runs.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison of hypothesis generation methods shows that constraining the LLM with dataset profiles drastically improves feasibility. |
| Hypothesis Generation |
Feasible Question Rate |
41 |
87 |
+46
|
| Hypothesis Generation |
Infeasible/Hallucinated Rate |
59 |
13 |
-46
|
| End-to-End Execution |
Cost per Run |
Not reported in the paper |
1.50 |
Not reported in the paper
|
Main Takeaways
- Dataset-aware conditioning is critical for empirical research agents; without it, models hallucinate variables more than half the time.
- The iterative 'Research Revision Loop' functions effectively: reviewer agents can trigger re-analysis by the econometrics agent to improve paper quality.
- The system is cost-effective ($0.8-$1.5 per paper) compared to human research assistance, though human oversight remains essential for semantic value.