CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

📝 Paper Summary

Benchmark datasets Multi-call tool use with flexible plan

CORE-Bench evaluates AI agents on their ability to computationally reproduce results from published scientific papers by executing code and retrieving specific outputs from Dockerized environments.

Core Problem

Verifying computational reproducibility is critical for science but labor-intensive, and current AI agents lack benchmarks measuring their ability to perform this real-world task.

Why it matters:

Computational reproducibility is fundamental to science, yet studies across fields (psychology, medicine, CS) show severe shortcomings where papers are irreproducible despite available code
Current coding benchmarks (HumanEval) focus on toy problems, failing to capture the complexity of real-world research tasks like library installation, debugging, and figure interpretation
Before agents can automate novel research, they must prove they can reproduce existing work, a necessary step often assumed but not tested

Concrete Example: A researcher needs to verify a paper's claim. They attempt to run the provided code but fail because the software libraries are version-incompatible or the instructions assume a specific operating system. An agent on CORE-Bench faces this exact scenario: it must install dependencies, debug execution errors, and extract a specific numerical result from a generated PDF or plot.

Key Novelty

Realistic Reproducibility Benchmark based on Containerized Environments

Builds tasks from CodeOcean capsules (verified reproducible compute environments) rather than synthetic coding problems, ensuring construct validity
Evaluates agents across three difficulty levels: traversing a finished environment, executing provided Docker instructions, and building the environment from a Readme alone
Includes both text-based tasks (extracting numbers from logs) and vision-based tasks (interpreting generated plots/figures)

Architecture

The evaluation harness architecture showing the Manager-Worker separation

Evaluation Highlights

Task-specific CORE-Agent with GPT-4o achieves 60.00% accuracy on the easiest level but drops to 21.48% on the hardest level
GPT-4o agents consistently outperform GPT-4o-mini agents (e.g., 21.48% vs 16.30% on Hard tasks)
Generalist AutoGPT agents perform poorly without task-specific prompting, scoring only 6.7% on Hard tasks compared to CORE-Agent's 21.5%

Breakthrough Assessment

8/10

Significant contribution to agentic evaluation. Moves beyond toy coding problems to real-world scientific workflows. The gap between easy (60%) and hard (21%) tasks highlights a clear frontier for future agent development.

⚙️ Technical Details

Problem Definition

Setting: Agentic interaction with a Linux shell to reproduce scientific results

Inputs: Task prompt (question about a paper's result) and a file system (containing code/data)

Outputs: A JSON file (report.json) containing the answer to the specific reproducibility question

Pipeline Flow

Manager initializes Worker VM
Agent receives task instructions and environment access
Agent interacts with shell/filesystem (installing, running, debugging)
Agent submits report.json
Manager verifies answer against ground truth

System Modules

Evaluation Harness

Orchestrates isolated virtual machines for secure and reproducible agent testing

Model or implementation: Custom Python-based harness

CORE-Agent

Task-specific agent variant modified to check output formats and use specific prompts

Model or implementation: GPT-4o / GPT-4o-mini

Novel Architectural Elements

Verification based on CodeOcean capsules: leveraging pre-verified reproducible environments to create ground truth, rather than manual verification of arbitrary papers
Three-tiered difficulty stratification based on starting state: Easy (completed run), Medium (Dockerfile provided), Hard (Readme only)

Modeling

Base Model: GPT-4o-2024-05-13 and GPT-4o-mini-2024-07-18

Reproducibility

Code: https://github.com/siegelz/core-bench

📊 Experiments & Results

Evaluation Setup

Agents run in isolated VMs with a 2-hour time limit and $4 cost limit per task

Benchmarks:

CORE-Bench-Easy (Navigation and Retrieval (Environment pre-run)) [New]
CORE-Bench-Medium (Docker Execution (Dockerfile provided)) [New]
CORE-Bench-Hard (Environment Setup (Readme only)) [New]

Metrics:

Task Accuracy (all questions for a task must be correct)
Average Cost ($)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance drops significantly as task difficulty increases (Easy -> Medium -> Hard), showing that environment setup is a major bottleneck.
CORE-Bench-Easy	Accuracy	35.6	60.00	+24.40
CORE-Bench-Medium	Accuracy	20.7	57.78	+37.08
CORE-Bench-Hard	Accuracy	6.7	21.48	+14.78
GPT-4o consistently outperforms GPT-4o-mini, though mini is significantly cheaper.
CORE-Bench-Hard	Accuracy	16.30	21.48	+5.18

Experiment Figures

Cost vs Accuracy trade-offs for GPT-4o and GPT-4o-mini agents across difficulty levels

Accuracy breakdown by discipline (CS, Medicine, Social Science) and Language (Python, R)

Main Takeaways

Task-specific modifications (CORE-Agent) massively improve performance over generic agents (AutoGPT), specifically via output format checks and prompting hints
Vision-based tasks are much harder than text-based tasks (59.26% vs 87.88% accuracy on Easy level), indicating agents struggle to interpret scientific figures
Computer Science papers were more reproducible than Medicine or Social Science papers, partly because they primarily use Python rather than R
Increasing cost limits beyond $4 did not significantly improve accuracy on Hard tasks; agents tend to get stuck in loops rather than needing more time to succeed

📚 Prerequisite Knowledge

Prerequisites

Understanding of computational reproducibility (running code to get matching results)
Familiarity with Docker and containerization
Basic knowledge of AI agents and tool use (shell, file editing)

Key Terms

computational reproducibility: The ability to reproduce the results of a scientific study using the data and code provided by its authors

CodeOcean: A cloud-based computational reproducibility platform that provides verified, containerized environments (capsules) for scientific code

Dockerfile: A text document that contains all the commands a user could call on the command line to assemble an image

AutoGPT: An open-source experimental application that attempts to make GPT-4 autonomous by chaining thoughts and actions

capsule: A self-contained computing environment (on CodeOcean) including code, data, and environment specifications

vision-language tasks: Tasks requiring the agent to interpret visual outputs like plots or figures to answer a query