← Back to Paper List

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Hjalmar Wijk, T. Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, L. Sato, William Saunders, M. Taran, Ben West, Elizabeth Barnes
arXiv.org (2024)
Agent Benchmark Reasoning

📝 Paper Summary

Agentic R&D capabilities AI safety evaluations Autonomous research agents
This paper presents 7 continuous-metric environments designed to evaluate whether AI agents can match human experts in solving complex, open-ended machine learning R&D problems over an 8-hour timeframe.
Core Problem
Current benchmarks test narrow coding or QA skills, failing to capture the long-horizon experimentation and debugging capabilities needed for real-world AI research and development.
Why it matters:
  • Automating AI R&D could create a runaway feedback loop of accelerating capabilities, potentially outpacing safety and oversight mechanisms
  • Without realistic evaluations, developers may miss the threshold where AI systems become capable of independent, transformative research or sabotage
  • Comparing agents against human experts on meaningful tasks provides a concrete 'early warning' signal for dangerous capability thresholds
Concrete Example: In the 'Optimize a kernel' environment, a human expert might spend hours understanding Triton documentation to write a highly efficient custom GPU kernel. In contrast, current agents might quickly try naive PyTorch implementations or simple tricks, failing to achieve the deep optimization required for a high score.
Key Novelty
Continuous-Metric R&D Evaluation Suite
  • Introduces 7 distinct environments (e.g., optimizing kernels, finetuning models) with continuous scoring metrics, allowing measurement of partial progress rather than binary pass/fail
  • Provides extensive human baselines (n=44) to establish what expert performance looks like over time, enabling direct comparison of agent progress curves against humans
Evaluation Highlights
  • Claude-3.5-Sonnet agents made meaningful progress in 3 out of 7 environments ('Optimize a kernel', 'Finetune GPT-2', 'Rust scaffolding'), occasionally beating weaker human baselines
  • Agents consistently outpace humans in the first hour due to rapid coding speed but plateau quickly, while humans continue to improve over the full 8-hour session
  • In 4 out of 7 environments, agents failed to improve upon the provided starting solution at all, struggling with debugging and resource management
Breakthrough Assessment
7/10
A significant step forward in realistic AI capability evaluation, moving beyond toy problems to actual R&D tasks. The detailed human baselines are highly valuable, though the number of environments (7) is small.
×