Evaluation Setup
Quasi-experimental wargame (US vs China 2026) with 2 moves. Move 1: Crisis response/ROEs. Move 2: Response to accidental escalation.
Benchmarks:
- US-China Wargame (Custom) (Strategic Decision Making) [New]
Metrics:
- Action Frequency Match (Number of actions where LLM freq approx Human freq)
- Response Vector Aggressiveness
- Conditional Probability of Escalation (Consistency)
- Statistical methodology: Bootstrap resampling at 95% confidence level; Linear Discriminatory Analysis for visualization
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
ฮ |
| Comparison of how often LLM action choices statistically matched the frequency of human expert choices across the 21 possible actions in the game. |
| US-China Wargame |
Matched Actions Count (Max 21) |
21 |
16 |
-5
|
| US-China Wargame |
Matched Actions Count (Max 21) |
21 |
10 |
-11
|
| US-China Wargame |
Matched Actions Count (Max 21) |
21 |
9 |
-12
|
Main Takeaways
- Simulating dialog between agents increases the aggressiveness of the final decision compared to asking for a direct decision, with dialogs exhibiting 'farcical harmony' rather than realistic debate
- GPT-3.5 is the most aggressive model, frequently choosing 'Fire at Chinese Vessels' and 'Activate Draft', whereas GPT-4/4o prefer 'Domestic Intelligence' and 'Cyber Operations'
- LLMs are insensitive to extreme personality prompting; agents prompted as 'pacifists' or 'aggressive sociopaths' showed no statistically significant difference in action selection
- While GPT-3.5 matches the raw frequency of individual human actions best (16/21), GPT-4 better captures the *conditional probability* (consistency) of human escalation behavior (e.g., probability of being aggressive in Move 2 given aggression in Move 1)