Evaluation Setup
Wildfire response scenarios with varying map sizes, fire intensities, and team compositions.
Benchmarks:
- CREW-Wildfire Main Tasks (Civilian Rescue, Fire Containment, Fire Extinguishment) [New]
Metrics:
- Success Rate (Task Completion)
- Area Burnt (%)
- Civilians Rescued
- Survival Rate of Agents
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Computational scalability tests show the environment handles large-scale simulations efficiently. |
| CREW-Wildfire Engine |
Max Agents Supported |
100 |
2000 |
+1900
|
| CREW-Wildfire Engine |
Max Map Size (Cells) |
Not reported in the paper |
1000000 |
Not reported in the paper
|
Main Takeaways
- Current LLM-based agents struggle significantly with spatial reasoning when coordinates are provided purely as text/ASCII
- While agents can form high-level plans (e.g., 'save the civilian'), they fail at precise real-time execution and coordination required to encircle a spreading fire
- Heterogeneity is underutilized; agents often fail to leverage the complementary strengths of drones (scouting) and bulldozers (clearing) effectively without explicit prompting
- The benchmark successfully exposes the 'gap' between chatting about a plan and executing it in a dynamic, stochastic environment