Evaluation Setup
Agents run in isolated VMs with a 2-hour time limit and $4 cost limit per task
Benchmarks:
- CORE-Bench-Easy (Navigation and Retrieval (Environment pre-run)) [New]
- CORE-Bench-Medium (Docker Execution (Dockerfile provided)) [New]
- CORE-Bench-Hard (Environment Setup (Readme only)) [New]
Metrics:
- Task Accuracy (all questions for a task must be correct)
- Average Cost ($)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance drops significantly as task difficulty increases (Easy -> Medium -> Hard), showing that environment setup is a major bottleneck. |
| CORE-Bench-Easy |
Accuracy |
35.6 |
60.00 |
+24.40
|
| CORE-Bench-Medium |
Accuracy |
20.7 |
57.78 |
+37.08
|
| CORE-Bench-Hard |
Accuracy |
6.7 |
21.48 |
+14.78
|
| GPT-4o consistently outperforms GPT-4o-mini, though mini is significantly cheaper. |
| CORE-Bench-Hard |
Accuracy |
16.30 |
21.48 |
+5.18
|
Main Takeaways
- Task-specific modifications (CORE-Agent) massively improve performance over generic agents (AutoGPT), specifically via output format checks and prompting hints
- Vision-based tasks are much harder than text-based tasks (59.26% vs 87.88% accuracy on Easy level), indicating agents struggle to interpret scientific figures
- Computer Science papers were more reproducible than Medicine or Social Science papers, partly because they primarily use Python rather than R
- Increasing cost limits beyond $4 did not significantly improve accuracy on Hard tasks; agents tend to get stuck in loops rather than needing more time to succeed