Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents

📝 Paper Summary

Multi-agent Self-evolving Agentic reasoning

Curie is an agentic framework that automates scientific experimentation by enforcing rigor through modular validation (Intra-ARM) and structured coordination (Inter-ARM) between architect and technician agents.

Core Problem

Existing AI agents for science rely on ad-hoc prompts and lack the rigor required for reliable experimentation, leading to hallucinations, unverified procedures, and non-reproducible results.

Why it matters:

Reckless or unverified AI experimentation generates untrustworthy findings, potentially polluting scientific literature with hallucinated results
Current LLM-based research assistants excel at literature review but fail at methodical execution (e.g., setting up controlled variables, handling dependencies)
Without structured oversight, errors in early experimental stages (like environment setup) propagate, wasting resources and compromising final conclusions

Concrete Example: When asked to reproduce a specific distributed systems experiment, a standard coding agent might generate a script that hardcodes variables or skips dependency checks. Curie's validator would catch the hardcoded values, force the agent to parameterize them, and verify the setup runs in a clean environment before allowing full execution.

Key Novelty

Experimental Rigor Engine (Intra-ARM & Inter-ARM)

Injects a 'supervisor' loop between planning and execution: an Intra-Agent Rigor Module (Intra-ARM) intercepts agent actions to validate them against specific policies (e.g., reproducibility checks) before proceeding
Uses an Inter-Agent Rigor Module (Inter-ARM) to break large experimental plans into independent partitions and schedule them, preventing the chaotic execution typical of naive multi-agent conversations

Architecture

The complete Curie architecture, showing the interaction between the Architect, Technicians, Inter-ARM, Intra-ARM, and the Experiment Knowledge Module.

Evaluation Highlights

3.4× improvement in correctly answering experimental questions compared to OpenHands (a state-of-the-art coding agent) on the new Experimentation Benchmark
Significantly outperforms Microsoft Magentic (generalist multi-agent system) on complex tasks involving reproduction and extension of research papers
Introduces a new benchmark of 46 rigorous tasks derived from real-world CS research papers and open-source projects

Breakthrough Assessment

8/10

Strong conceptual contribution in formalizing 'rigor' for agents via explicit validation modules. The 3.4x gain is impressive, though the benchmark is relatively small (46 tasks) and domain-specific (CS research).

⚙️ Technical Details

Problem Definition

Setting: Automated execution of empirical CS research tasks, requiring hypothesis formulation, experimental design, code execution, and result analysis

Inputs: Experimental question and relevant context (e.g., domain knowledge, starter code)

Outputs: Executed experiment artifacts (code, logs), data analysis, and a final answer to the research question

Pipeline Flow

Architect Agent (Plan Design)
Inter-ARM (Partitioning & Scheduling)
Intra-ARM (Plan Validation)
Technician Agent (Setup & Execution)
Intra-ARM (Execution Validation)
Experiment Knowledge Module (Write Results)

System Modules

Architect Agent

Designs high-level experimental plans (hypotheses, variables) and analyzes results

Model or implementation: LLM-based (Specific model not fixed, framework allows swapping)

Technician Agent

Implements experimental setup and executes trials

Model or implementation: LLM-based (Specific model not fixed)

Intra-ARM

Validates outputs from agents using specific policies (e.g., check for hardcoded values, verify runnability)

Model or implementation: Rule-based + LLM-based verifiers

Inter-ARM

Manages control flow, breaks plans into partitions, and schedules tasks based on agent availability

Model or implementation: Algorithmic logic

Novel Architectural Elements

Dual-layer Rigor Engine (Intra-ARM and Inter-ARM) that explicitly decouples validation and coordination from the agents themselves
Partition-based execution model where experimental plans are broken into independent variable subsets for parallel/modular execution
Tiered write access policy in the Knowledge Module preventing agents from corrupting parts of the experiment they don't own

Modeling

Base Model: Claude 3.5 Sonnet (used for experiments in paper)

Compute: Not reported in the paper

Comparison to Prior Work

vs. OpenHands: OpenHands focuses on solving coding issues (SWE-Bench) via a single loop; Curie adds explicit rigor modules (Intra/Inter-ARM) specifically for experimental control and analysis
vs. Magentic: Magentic uses general coordination; Curie enforces scientific methodology (hypothesis -> design -> execution -> analysis) via structured state transitions
vs. ResearchAgent: ResearchAgent focuses on ideation/writing; Curie focuses on the actual execution of experiments (code, data collection)
+ 1 more
vs. AI Scientist [not cited in paper]: AI Scientist proposes an end-to-end automated researcher; Curie focuses specifically on the rigorous experimentation phase with explicit validation gates rather than just open-ended discovery

Limitations

Benchmark is limited to Computer Science domain tasks (46 questions), reducing generalizability to wet-lab sciences
Relying on LLM-based validators in Intra-ARM may still be subject to subtle hallucinations or oversight
High cost and latency due to extensive validation loops and multi-agent coordination steps compared to single-shot agents

Reproducibility

Code: https://github.com/Just-Curieous/Curie

publicly available (https://github.com/Just-Curieous/Curie). The benchmark dataset (46 tasks) and the framework code are open sourced. Paper uses Claude 3.5 Sonnet API for evaluation.

📊 Experiments & Results

Evaluation Setup

End-to-end execution of experimental tasks derived from CS papers/projects

Benchmarks:

Experimentation Benchmark (Scientific Experimentation (Reproduction, Extension, Challenge)) [New]

Metrics:

Success Rate (correctly answering the experimental question)
Cost (USD)
Execution Time
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experimentation Benchmark	Success Rate	Not explicitly reported as a single aggregate number in text, derived from 3.4x claim	Not explicitly reported as a single aggregate number in text	3.4x improvement

Experiment Figures

A taxonomy of the 46 tasks in the Experimentation Benchmark, categorized by domain (Networking, ML, Systems, DB) and task type (Reproduce, Extend, Challenge).

Main Takeaways

Curie achieves a 3.4x improvement over state-of-the-art coding agents (OpenHands) in experimental task success.
The breakdown of the benchmark into 46 tasks across 4 domains provides a granular look at agent capabilities in real-world CS research contexts.
The rigorous validation approach (Intra/Inter-ARM) significantly reduces 'reckless' errors where agents hallucinate successful execution.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with LLM-based agents (planning, tool use)
Basic understanding of scientific method (variables, controls, hypotheses)
Software engineering concepts (reproducibility, environment setup)

Key Terms

Intra-ARM: Intra-Agent Rigor Module—a component that validates individual agent actions (e.g., checking if code compiles) before they are finalized

Inter-ARM: Inter-Agent Rigor Module—a component that coordinates workflow between agents, managing task partitioning and state transitions

Architect Agent: High-level planner agent responsible for designing the experiment, defining variables, and analyzing final results

Technician Agent: Low-level executor agent responsible for writing code, setting up environments, and running trials

Experiment Knowledge Module: A structured database (DAG-like history) that tracks the state of the experiment, preventing LLM memory loss and hallucination

SWE-Bench: Software Engineering Benchmark—a standard dataset for evaluating LLMs on real-world coding issues

process supervision: A technique where feedback is provided at intermediate steps of reasoning/execution rather than just on the final output

DAG: Directed Acyclic Graph—a data structure used here to track the history of experimental changes without loops