The Influence of Human-inspired Agentic Sophistication in LLM-driven Strategic Reasoners

📝 Paper Summary

Agentic AI Multi-agent simulation Game Theory

The paper investigates whether increasing the architectural complexity of LLM-based agents—through decoupled reasoning and role-playing profiles—improves their ability to replicate human strategic behavior in guessing games.

Core Problem

LLMs are increasingly treated as agents, but it is unclear if their strategic reasoning aligns with human behavior (bounded rationality) or if adding agentic sophistication merely makes them theoretical optimizers.

Why it matters:

Agent-based models in human-centered domains require validation that agents behave reliably and understandably, not just optimally
Current game-theoretic benchmarks often lack standardized frameworks for hosting heterogeneous agent architectures (LLMs vs. traditional models)
The 'black-box' nature of LLMs creates reproducibility and explainability gaps in social simulations

Concrete Example: In a 2-player guessing game (p=2/3), the theoretically optimal move is 0. However, humans rarely play 0 immediately. A standard game-theoretic model (EWA) might learn to play near 0 (mean 11.19), failing to simulate the actual human mean (29.05) and thus failing as a descriptive model of human behavior.

Key Novelty

Human-inspired Agentic Sophistication Framework

Decomposes agent design into 'Simple' (one-shot) vs 'Reasoner' (decoupled belief formation and decision) architectures to test if explicit reasoning steps improve human alignment
Integrates psychological 'Models of Appropriateness' (MoA) into prompts, forcing agents to ask 'What kind of person am I?' and 'What kind of situation is this?' before acting
Uses a centralized 'Umpire' framework to standardize interactions between LLM agents and traditional game-theoretic models (EWA) in guessing games

Evaluation Highlights

EWA (traditional model) diverges significantly from human behavior with a Wasserstein distance of 22.34, playing far more aggressively (mean 11.19) than humans (mean 29.05)
Human experts show significantly higher skewness (1.50) in their guess distribution compared to students (0.55), establishing a distinct behavioral target for agents
Qualitative finding: The relationship between agentic design complexity (adding profiles/reasoning steps) and human-likeness is non-linear, suggesting simple architectural augmentation has limits

Breakthrough Assessment

6/10

Solid methodology for evaluating agentic reasoning against human baselines using game theory. The framework is rigorous, though the specific quantitative results for LLM agents (beyond the EWA baseline) are cut off in the provided text.

⚙️ Technical Details

Problem Definition

Setting: Two-player perspective-based Guessing Games (p-Beauty Contests) where p=2/3 and the range is integers [0, 100]

Inputs: Natural language game description x, Agent Context c (profile), Instruction model m

Outputs: Integer guess a in [0, 100]

Pipeline Flow

Umpire (initializes game and agents)
Agent Interpretation Function (processes game description)
Reasoning/Decision Module (generates guess)

System Modules

Umpire

Manages gameplay, pairs agents, and translates natural language descriptions into formal representations for non-LLM agents

Model or implementation: Code-based controller

Simple Agent (S) (Agent Inference)

Directly maps game description and context to a numerical guess in one step

Model or implementation: Claude Haiku 3.5 or Sonnet 3.7

Reasoner Agent (R) (Agent Inference)

Decouples reasoning (forming belief about opponent) from decision (selecting best response)

Model or implementation: Claude Haiku 3.5 or Sonnet 3.7

Novel Architectural Elements

Integration of Model of Appropriateness (MoA) into the prompt structure, explicitly asking identity/situation questions
Formal distinction between 'Simple' (direct mapping) and 'Reasoner' (belief-decision split) interpretations within a unified OODA-based framework

Modeling

Base Model: Claude 3.5 Haiku and Claude 3.7 Sonnet

Comparison to Prior Work

vs. EWA: EWA uses formal mathematical updates and converges to high rationality; Proposed LLM agents use natural language reasoning and aim to approximate bounded human rationality
vs. Standard LLM Prompting: Incorporates 'Model of Appropriateness' (MoA) psychological questions into the instruction set

Limitations

Analysis is limited to two-player one-shot guessing games, reducing complexity compared to n-player iterative games
EWA baseline cannot process natural language, requiring a separate translation step by the Umpire
Relationship between design complexity and human-likeness is non-linear, suggesting diminishing returns or complexity penalties

Reproducibility

No code repository provided in the text. EWA parameters are explicitly listed (lambda=2.39, tau=1.5, kappa=0, N(0)=1). Human dataset source cited ([28]). Prompts described conceptually (Contexts: No profile, Simple, Bio; Instructions: m0, m1) but full templates not provided.

📊 Experiments & Results

Evaluation Setup

Two-player guessing game (p=2/3) simulation

Benchmarks:

Human Guessing Game Dataset (Strategic Reasoning)

Metrics:

Mean guess value
Wasserstein distance (between agent and human distributions)
Frequency of zero guesses (dominant strategy)
Statistical methodology: Levene’s test for variance homogeneity; Independent samples t-test or Welch’s t-test for means; Mann–Whitney U for skewed data.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human Dataset	Mean Guess	29.05	11.19	-17.86
Human Dataset	Wasserstein Distance	0.00	22.34	+22.34
Human Dataset	Skewness	0.55	1.50	+0.95

Experiment Figures

Comparison of guess distributions between the EWA agent and Human participants

Main Takeaways

Traditional game-theoretic models (EWA) are 'super-human' in rationality (mean 11.19 vs 29.05) but fail to capture the bounded rationality of human players, evidenced by a large Wasserstein distance.
Human experts are distinct from students, showing significantly lower means and higher skewness (1.50 vs 0.55), providing two distinct targets for agent calibration.
The paper posits that simply adding agentic complexity (reasoning steps, profiles) does not linearly result in better human approximation, depending heavily on the underlying LLM capabilities.

📚 Prerequisite Knowledge

Prerequisites

Game Theory (Nash Equilibrium, dominated strategies)
k-level reasoning theory
Basic LLM prompting strategies (Chain-of-Thought, Persona prompting)

Key Terms

Guessing Game: A game where players guess a number between 0-100, and the winner is closest to p times the average guess (here p=2/3)

EWA: Experience Weighted Attraction—a game-theoretic learning model that updates strategy weights based on past payoffs and 'foregone' payoffs

k-level reasoning: A cognitive hierarchy theory where level-0 plays randomly, level-1 plays best response to level-0, and level-k plays best response to level-(k-1)

MoA: Model of Appropriateness—a decision-making framework where agents assess situation identity, personal identity, and appropriate rules before acting

OODA loop: Observe-Orient-Decide-Act—a military strategy concept adapted here for agent control flow

Wasserstein distance: A distance metric between probability distributions, measuring the 'work' needed to transform one distribution into the other

Poisson distribution: A probability distribution used here to model the initial attraction weights for the EWA agent