CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

📝 Paper Summary

Multi-agent Benchmark datasets Agentic AI

CREW-Wildfire is a scalable, procedurally generated benchmark for evaluating LLM-based multi-agent systems on complex, physically grounded wildfire response tasks requiring coordination under uncertainty.

Core Problem

Existing multi-agent benchmarks are either too small-scale, symbolic/turn-based (like Hanabi), or lack the architectural support for large-scale LLM-based agent coordination in embodied, dynamic environments.

Why it matters:

Real-world tasks like disaster response require coordinating hundreds of heterogeneous agents (drones, bulldozers, firefighters) under partial observability
Current MARL benchmarks rely on rigid communication or centralized training that doesn't scale to flexible, language-based Agentic AI
It is unclear if current LLM agents can handle the dual challenge of strategic long-horizon planning and precise low-level execution

Concrete Example: In a wildfire scenario, a drone might spot a fire spread that ground crews cannot see. In current systems, the drone often fails to effectively communicate this spatial information to guide a bulldozer to cut a firebreak in time, leading to mission failure due to lack of coordination.

Key Novelty

Procedurally Generated Multi-Agent Wildfire Benchmark (CREW-Wildfire)

Simulates realistic wildfire dynamics (wind, slope, moisture) with heterogeneous agents (Firefighters, Bulldozers, Drones, Helicopters) that have distinct, complementary capabilities
Integrates Perception and Execution modules to bridge high-level LLM reasoning with low-level simulation control, allowing agents to 'see' via text summaries and 'act' via code
Supports massive scale (2000+ agents, 1M+ grid cells) to test scalability limits of agentic frameworks

Architecture

The CREW-Wildfire system architecture bridging the Unity simulation with Agentic LLMs.

Evaluation Highlights

Benchmarked multiple state-of-the-art LLM frameworks (e.g., hierarchical, consensus-based) revealing significant failures in spatial reasoning and real-time coordination
Demonstrated scalability up to 2000 agents and 1 million map cells on a single consumer desktop (16GB GPU/RAM)
Established that while emergent collaboration appears in simple tasks, current agents struggle with objective prioritization and plan adaptation under uncertainty

Breakthrough Assessment

8/10

Fills a critical gap for physically grounded, scalable multi-agent LLM benchmarks. The integration of low-level simulation with high-level agentic interfaces is robust and timely.

⚙️ Technical Details

Problem Definition

Setting: Partially observable stochastic game (POSG) modeling wildfire response

Inputs: Observations O (local mini-map, agent status, messages from peers) via text or tensor

Outputs: Actions A (movement, tool use, communication) via code or text commands

Pipeline Flow

Environment Generation (Procedural map & fire init)
Perception Module: Input Processing (Raw Tensor/Image → ASCII/Text)
Agent Reasoning (LLM-based planning & communication)
Execution Module: Action Translation (Text Command → Action Code)
Simulation Step (Unity/CREW backend updates state)

System Modules

Perception Module

Translate raw environment data into LLM-readable text

Model or implementation: Rule-based translator / VLM (optional)

Agent Brain

Decision making, planning, and communication

Model or implementation: Various LLMs (benchmarked with GPT-4, Llama-3, etc.)

Execution Module

Convert natural language commands into executable primitive actions

Model or implementation: LLM-based translator (e.g., GPT-4 or smaller model)

Novel Architectural Elements

Hybrid architecture combining a high-fidelity Unity physics engine with modular Perception/Execution wrappers specifically designed for LLM agents
Scalable cellular automata fire model optimized for real-time interaction with thousands of agents

Modeling

Base Model: Varies (benchmark supports plug-and-play LLMs like GPT-4, Llama-3)

Training Method: In-context learning / Prompting (Evaluation only benchmark)

Adaptation: None (Prompt engineering only)

Trainable Parameters: None (Inference only)

Compute: Supports 2000+ agents on Desktop (16GB GPU + 16GB RAM)

Comparison to Prior Work

vs. FireCommander: CREW-Wildfire supports LLM-based agents with natural language interfaces, whereas FireCommander focuses on probabilistic/MARL approaches
vs. StarCraft II: Focused on open-ended disaster response with heterogeneous cooperative roles rather than zero-sum competitive play
vs. Overcooked: Scales to thousands of agents and large maps, unlike small grid-world constraints of Overcooked
+ 1 more
vs. Hivex: Designed specifically for Agentic AI (LLMs) with text-based perception/execution wrappers, rather than just numerical control

Limitations

Simulation fidelity vs. reality gap: cellular automata fire model is an approximation
Perception module reliance: performance heavily depends on how well the text summary captures the grid state
LLM cost/latency: running thousands of LLM agents is computationally expensive and slow compared to MARL policies
Limited fine-motor control: abstraction layer prevents continuous low-level physics manipulation

Reproducibility

Code: https://github.com/general-robotics-lab/CREW-Wildfire

publicly available (https://github.com/general-robotics-lab/CREW-Wildfire). Includes environments, Perception/Execution modules, and baseline implementations. Specific prompt templates are in Appendix.

📊 Experiments & Results

Evaluation Setup

Wildfire response scenarios with varying map sizes, fire intensities, and team compositions.

Benchmarks:

CREW-Wildfire Main Tasks (Civilian Rescue, Fire Containment, Fire Extinguishment) [New]

Metrics:

Success Rate (Task Completion)
Area Burnt (%)
Civilians Rescued
Survival Rate of Agents
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Computational scalability tests show the environment handles large-scale simulations efficiently.
CREW-Wildfire Engine	Max Agents Supported	100	2000	+1900
CREW-Wildfire Engine	Max Map Size (Cells)	Not reported in the paper	1000000	Not reported in the paper

Experiment Figures

Visualizations of the fire propagation model based on wind, slope, and vegetation.

Procedural generation examples showing diverse terrain maps.

Main Takeaways

Current LLM-based agents struggle significantly with spatial reasoning when coordinates are provided purely as text/ASCII
While agents can form high-level plans (e.g., 'save the civilian'), they fail at precise real-time execution and coordination required to encircle a spreading fire
Heterogeneity is underutilized; agents often fail to leverage the complementary strengths of drones (scouting) and bulldozers (clearing) effectively without explicit prompting
The benchmark successfully exposes the 'gap' between chatting about a plan and executing it in a dynamic, stochastic environment

📚 Prerequisite Knowledge

Prerequisites

Multi-Agent Reinforcement Learning (MARL) concepts
Large Language Models (LLMs) for planning
Cellular Automata (for fire simulation)

Key Terms

Agentic AI: AI systems that can autonomously plan, reason, and take actions to accomplish tasks, often using LLMs as a brain

MARL: Multi-Agent Reinforcement Learning—learning policies for multiple agents interacting in a shared environment

Cellular Automata: A discrete model studied in computability theory, consisting of a grid of cells where each cell's state changes based on its neighbors (used here for fire spread)

Perlin noise: A type of gradient noise used to generate natural-looking textures and terrain procedurally

Partial observability: A setting where agents can only perceive a limited part of the environment, not the full state

VLMs: Vision-Language Models—AI models capable of processing both image and text inputs