โ† Back to Paper List

Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations

Max Lamparth, Anthony Corso, Jacob Ganz, Oriana Skylar Mastro, Jacquelyn Schneider, Harold Trinkunas
arXiv (2024)
Agent Benchmark Reasoning

๐Ÿ“ Paper Summary

Agentic AI Multi-agent simulation AI Safety / Alignment
Large language models instructed to simulate expert human teams in military wargames exhibit superficial agreement on actions but display dangerous escalation tendencies and cannot accurately model distinct personality traits.
Core Problem
It is unknown whether AI agents, increasingly considered for military decision-making, actually align with expert human behavior in high-stakes crisis scenarios or if they introduce new risks.
Why it matters:
  • Governments are investing in AI for military command and control, but potential misalignment could lead to unintentional escalation or nuclear use
  • Prior research relied on small sample sizes or lacked direct comparison to large groups of national security experts
  • Current models may harbor intrinsic biases toward violence or 'farcical harmony' that do not reflect realistic human strategic friction
Concrete Example: In a simulated US-China crisis, GPT-3.5 agents instructed to act as US decision-makers frequently chose to 'Fire at Chinese Vessels' and 'Activate Civilian Draft'โ€”actions significantly more aggressive than those chosen by actual human experts.
Key Novelty
Large-Scale Expert Behavioral Benchmarking
  • compares LLM-simulated agents against a dataset of 214 real-world national security experts (48 teams) in a complex, multi-stage wargame
  • analyzes the impact of 'simulated dialog' between agents versus direct action selection, revealing that simulating conversation paradoxically increases aggressiveness
  • probes agent sensitivity to personality prompts (e.g., 'pacifist' vs 'aggressive sociopath'), finding LLMs fail to alter behavior based on these traits
Architecture
Architecture Figure Figure 1
Experimental setup showing the flow from Player/LLM to Team Dialog to Action Selection within the Wargame context
Evaluation Highlights
  • GPT-3.5 statistically matches human action frequency on 16 of 21 possible game actions, significantly higher than GPT-4 (10 matches) or GPT-4o (9 matches)
  • Despite frequency matching, LLMs show qualitative failure: GPT-3.5 tends towards extreme escalation (firing on vessels), while GPT-4/4o prefer passive escalation (cyber/intel ops)
  • Simulated agents fail to adopt extreme personality traits; no statistically significant difference in behavior was found between agents prompted as 'pacifists' vs. 'aggressive sociopaths'
Breakthrough Assessment
7/10
Strong empirical paper providing a rare dataset of expert human behavior to benchmark agents. Findings on 'farcical harmony' and the failure of personality prompting are significant safety warnings.
×