Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations

📝 Paper Summary

Agentic AI Multi-agent simulation AI Safety / Alignment

Large language models instructed to simulate expert human teams in military wargames exhibit superficial agreement on actions but display dangerous escalation tendencies and cannot accurately model distinct personality traits.

Core Problem

It is unknown whether AI agents, increasingly considered for military decision-making, actually align with expert human behavior in high-stakes crisis scenarios or if they introduce new risks.

Why it matters:

Governments are investing in AI for military command and control, but potential misalignment could lead to unintentional escalation or nuclear use
Prior research relied on small sample sizes or lacked direct comparison to large groups of national security experts
Current models may harbor intrinsic biases toward violence or 'farcical harmony' that do not reflect realistic human strategic friction

Concrete Example: In a simulated US-China crisis, GPT-3.5 agents instructed to act as US decision-makers frequently chose to 'Fire at Chinese Vessels' and 'Activate Civilian Draft'—actions significantly more aggressive than those chosen by actual human experts.

Key Novelty

Large-Scale Expert Behavioral Benchmarking

compares LLM-simulated agents against a dataset of 214 real-world national security experts (48 teams) in a complex, multi-stage wargame
analyzes the impact of 'simulated dialog' between agents versus direct action selection, revealing that simulating conversation paradoxically increases aggressiveness
probes agent sensitivity to personality prompts (e.g., 'pacifist' vs 'aggressive sociopath'), finding LLMs fail to alter behavior based on these traits

Architecture

Experimental setup showing the flow from Player/LLM to Team Dialog to Action Selection within the Wargame context

Evaluation Highlights

GPT-3.5 statistically matches human action frequency on 16 of 21 possible game actions, significantly higher than GPT-4 (10 matches) or GPT-4o (9 matches)
Despite frequency matching, LLMs show qualitative failure: GPT-3.5 tends towards extreme escalation (firing on vessels), while GPT-4/4o prefer passive escalation (cyber/intel ops)
Simulated agents fail to adopt extreme personality traits; no statistically significant difference in behavior was found between agents prompted as 'pacifists' vs. 'aggressive sociopaths'

Breakthrough Assessment

7/10

Strong empirical paper providing a rare dataset of expert human behavior to benchmark agents. Findings on 'farcical harmony' and the failure of personality prompting are significant safety warnings.

⚙️ Technical Details

Problem Definition

Setting: Multi-stage wargame simulation (US vs. China crisis) with imperfect information and open-ended action spaces

Inputs: Scenario briefing, player background attributes (experience, age), strategic priorities, and game state updates

Outputs: Simulated dialog between team members and a final 'Response Vector' (binary selection over 21 possible actions)

Pipeline Flow

Input Processing (Scenario + Personas)
Dialog Simulation (Optional/Variable)
Action Selection

System Modules

Persona Generator

Injects player attributes (age, background) and game briefing into the context

Model or implementation: Prompt-based

Team Simulator

Simulates conversation between team members to reach a decision

Model or implementation: gpt-3.5-turbo-16k / gpt-4-1106-preview / gpt-4o

Decision Engine

Selects final actions based on the discussion or direct prompt

Model or implementation: gpt-3.5-turbo-16k / gpt-4-1106-preview / gpt-4o

Novel Architectural Elements

Comparative pipeline that toggles 'simulated dialog' to measure its causal effect on decision aggressiveness

Modeling

Base Model: gpt-3.5-turbo-16k, gpt-4-1106-preview, and gpt-4o

Compute: Not reported in the paper (Inference-only study)

Comparison to Prior Work

vs. Emery (2021): Uses generative LLMs simulating teams rather than traditional algorithmic game theory models
vs. Rivera et al. (2024): Compares directly to a large (N=214) expert human baseline rather than just inter-model comparison

Limitations

LLMs failed to model requested character traits (pacifist/sociopath), suggesting prompt insensitivity for high-level behavior
Study restricted to a specific US-China scenario; generalization to other conflicts unknown
Simulated dialogs lacked realistic human friction ('farcical harmony'), limiting the fidelity of the simulation
Human sample biased towards US national security perspective

Reproducibility

Code: https://github.com/ancorso/LLMWargaming

Code and materials available at github.com/ancorso/LLMWargaming. Human player data (214 experts) is partially available (anonymized) but privacy-violating info is excluded. Prompt templates are included in the repository.

📊 Experiments & Results

Evaluation Setup

Quasi-experimental wargame (US vs China 2026) with 2 moves. Move 1: Crisis response/ROEs. Move 2: Response to accidental escalation.

Benchmarks:

US-China Wargame (Custom) (Strategic Decision Making) [New]

Metrics:

Action Frequency Match (Number of actions where LLM freq approx Human freq)
Response Vector Aggressiveness
Conditional Probability of Escalation (Consistency)
Statistical methodology: Bootstrap resampling at 95% confidence level; Linear Discriminatory Analysis for visualization

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of how often LLM action choices statistically matched the frequency of human expert choices across the 21 possible actions in the game.
US-China Wargame	Matched Actions Count (Max 21)	21	16	-5
US-China Wargame	Matched Actions Count (Max 21)	21	10	-11
US-China Wargame	Matched Actions Count (Max 21)	21	9	-12

Experiment Figures

Linear Discriminatory Analysis (LDA) projection of response vectors for humans vs. LLMs

Aggressiveness score vs. Length of Simulated Dialog

Main Takeaways

Simulating dialog between agents increases the aggressiveness of the final decision compared to asking for a direct decision, with dialogs exhibiting 'farcical harmony' rather than realistic debate
GPT-3.5 is the most aggressive model, frequently choosing 'Fire at Chinese Vessels' and 'Activate Draft', whereas GPT-4/4o prefer 'Domestic Intelligence' and 'Cyber Operations'
LLMs are insensitive to extreme personality prompting; agents prompted as 'pacifists' or 'aggressive sociopaths' showed no statistically significant difference in action selection
While GPT-3.5 matches the raw frequency of individual human actions best (16/21), GPT-4 better captures the *conditional probability* (consistency) of human escalation behavior (e.g., probability of being aggressive in Move 2 given aggression in Move 1)

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) and prompting strategies
Basic understanding of game theory or simulation environments
Knowledge of statistical significance testing (bootstrapping)

Key Terms

Wargame: A strategy game that simulates military conflict scenarios to test decisions and outcomes

Response Vector: A binary vector representing the set of actions chosen by a player or agent team in a single game move

Linear Discriminatory Analysis: A statistical method used here to project high-dimensional action vectors into 2D space to visualize behavioral clustering

Farcical Harmony: The tendency of LLM agents simulating a team to agree with each other artificially quickly, lacking the realistic debate or friction of human teams

Roles-of-engagement: Directives that define the circumstances under which military forces may initiate combat

Quasi-experimental design: An empirical study used to estimate the causal impact of an intervention on target population without random assignment

GPT-4o: An omni-model version of GPT-4 by OpenAI, optimized for speed and multimodal capabilities