← Back to Paper List

Multi-Agent Risks from Advanced AI

Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, Wolfram Barfuss, Jakob Foerster, Tomáš Gavenčiak, The Anh Han, Edward Hughes, Vojtěch Kovařík, Jan Kulveit, Joel Z. Leibo, Caspar Oesterheld, Christian Schroeder de Witt, Nisarg Shah, Michael Wellman, Paolo Bova, Theodor Cimpeanu, Carson Ezell, Quentin Feuillade-Montixi, Matija Franklin, Esben Kran, Igor Krawczuk, Max Lamparth, Niklas Lauffer, Alexander Meinke, Sumeet Motwani, et al.
Cooperative AI Foundation, University of Oxford, Google DeepMind, Carnegie Mellon University, Harvard University
arXiv (2025)
Agent RL Benchmark

📝 Paper Summary

Multi-Agent Systems AI Safety
This report establishes a taxonomy of risks unique to advanced multi-agent AI systems, categorizing failures into miscoordination, conflict, and collusion driven by seven structural risk factors.
Core Problem
Current AI safety research predominantly focuses on single-agent alignment, failing to address critical risks that emerge only when multiple advanced agents interact, such as conflicting conventions or resource depletion.
Why it matters:
  • Aligning individual agents is insufficient to prevent conflict if actors have diverging interests
  • Errors acceptable in isolated models can compound catastrophically in dynamic multi-agent networks
  • Groups of agents can collude to develop dangerous capabilities or goals not ascribable to any single individual
Concrete Example: In a driving simulation (Case Study 1), two AI agents trained on different conventions (US vs. Indian traffic laws) crash 77.5% of the time because they cannot zero-shot coordinate on yielding rules, whereas unspecialized base models fail only 5% of the time.
Key Novelty
Multi-Agent Risk Taxonomy
  • Classifies failure modes based on agent incentives: Miscoordination (same goal, failed action), Conflict (opposing goals), and Collusion (cooperation undesirable to outsiders).
  • Identifies seven distinct risk factors driving these failures, including Information Asymmetries, Network Effects, and Destabilising Dynamics.
Evaluation Highlights
  • Specialized driving agents trained on conflicting conventions failed to coordinate 77.5% of the time, compared to a 5.0% failure rate for unspecialized base models.
  • In the GovSim benchmark, advanced LLMs depleted shared resources in 46% of cases (54% survival rate), replicating the tragedy of the commons.
  • Demonstrates that convention-following cannot be assumed in zero-shot interactions between heterogeneous agents.
Breakthrough Assessment
9/10
A foundational comprehensive report that defines the scope of multi-agent AI safety, providing a necessary taxonomy and concrete examples for a critical, under-studied area.
×