Chaotic Dynamics in Multi-LLM Deliberation

📝 Paper Summary

Multi-agent Collective AI systems

Multi-LLM committees exhibit chaotic instability where identical inputs produce diverging outcomes, driven by role differentiation and model heterogeneity, but this can be mitigated by Chair ablation or memory reduction.

Core Problem

Multi-LLM committees used for governance are often assumed to be deterministic at temperature T=0, but they exhibit structural instability where nominally identical runs diverge into different decisions.

Why it matters:

Reproducibility is a governance property; if identical runs yield different policies, institutions face critical uncertainty.
Current evaluations rely on one-shot metrics that miss trajectory sensitivity, leading to a false sense of security.
Unpredictability limits controllability and explainability in high-stakes collective AI decision-making.

Concrete Example: In the HL-01 benchmark scenario, five agents debating health policy diverge into different collective mean preferences across runs even at T=0, with divergence growing exponentially over deliberation rounds.

Key Novelty

Stability Auditing for Multi-LLM Committees

Models committee deliberation as a random dynamical system to quantify instability using an empirical Lyapunov exponent derived from trajectory divergence.
Identifies two distinct, non-additive routes to chaos: institutional role differentiation (e.g., assigning a Chair) and compositional heterogeneity (mixing model families).
Demonstrates that instability is not just thermal noise (persists at T=0) but is structurally induced by protocol design, specifically memory depth and synthesis roles.

Evaluation Highlights

Heterogeneous committees (mixed models) without roles show high divergence (Lyapunov exponent = 0.0947) compared to homogeneous baselines.
Adding role mandates to homogeneous committees increases instability (Lyapunov exponent increases from 0.0221 to 0.0541).
Reducing argument memory depth from k=15 to k=3 consistently lowers divergence across four tested scenarios.

Breakthrough Assessment

8/10

Strong empirical characterization of a critical but overlooked problem (instability at T=0) in multi-agent systems. Provides actionable design principles (role ablation, memory reduction) for governance.

⚙️ Technical Details

Problem Definition

Setting: Five-agent committee deliberation over T rounds as a random dynamical system

Inputs: Scenario prompt, sliding transcript window (size k), committee state table

Outputs: Committee mean trajectory of preference states on a simplex

Pipeline Flow

Input: Scenario Prompt + History Window
Agent Deliberation (Argument + Preference State Generation)
Parsing (Extract p_A, p_B, p_C)
Aggregation (Committee Mean Calculation)
Update History Window

System Modules

Committee Agent

Generate argument and structured preference state based on role and history

Model or implementation: Varied (GPT-4.1-mini, Claude Sonnet 4.6, Gemini 2.5 Flash, Grok-3-mini)

Deterministic Parser

Extract preference vectors from agent outputs

Model or implementation: Rule-based

Novel Architectural Elements

Application of empirical Lyapunov exponents to quantify inter-run sensitivity in multi-agent LLM systems.
Factorial design crossing role structure (Roles/NoRoles) with model composition (Uniform/Mixed) to isolate instability sources.

Modeling

Base Model: GPT-4.1-mini (Uniform condition) / Ensemble of GPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Flash, Grok-3-mini (Mixed condition)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Single-shot eval: Measures dynamic stability over time (20 rounds) rather than static accuracy.
vs. Standard Multi-Agent: Explicitly quantifies 'chaos' (exponential divergence) rather than just final vote outcomes.
vs. Agent-Safety (Shapira et al.): Uses a narrow dynamical definition of chaos (Lyapunov) rather than a broad descriptive one.

Limitations

Analysis is limited to a specific windowed-summary deliberation protocol and 12 policy scenarios.
Does not yet link instability to external task quality (accuracy or decision harm).
Chair ablation results are scenario-contingent (significant in 2 of 5 tested scenarios).
Results focus on T=0; while SI shows persistence at higher temperatures, the main claims are specific to the deterministic regime.

Reproducibility

Code S1 (Experiment and analysis scripts) and Data S1 (JSONL run artifacts) are mentioned in Supplementary Materials. Specific hyperparameters (T=0, k=15 default) are provided.

📊 Experiments & Results

Evaluation Setup

12 policy scenarios across immigration, health, income, climate, speech, and AI governance.

Benchmarks:

HL-01 (Health Policy) (Multi-round committee deliberation) [New]
IM-01 (Immigration) (Multi-round committee deliberation) [New]
CL-01 (Climate) (Multi-round committee deliberation) [New]
SP-03 (Speech) (Multi-round committee deliberation) [New]
AI-01 (AI Governance) (Multi-round committee deliberation) [New]

Metrics:

Empirical Lyapunov exponent (lambda_hat)
Statistical methodology: Bootstrap confidence intervals computed by replicate resampling.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results demonstrating the impact of committee composition (Roles vs. NoRoles and Homogeneous vs. Mixed) on stability in the HL-01 benchmark.
HL-01	Lyapunov Exponent (lambda_hat)	0.0221	0.0541	+0.0320
HL-01	Lyapunov Exponent (lambda_hat)	0.0221	0.0947	+0.0726
HL-01	Lyapunov Exponent (lambda_hat)	0.0947	0.0519	-0.0428
Average across 4 scenarios	Lyapunov Exponent (lambda_hat)	See Notes	See Notes	Negative

Main Takeaways

Instability is design-induced via two routes: institutional differentiation (Roles) and compositional heterogeneity (Mixed Models).
These routes interact non-additively; adding roles to a mixed committee actually reduced instability compared to the mixed, no-role condition.
The Chair role is a dominant amplifier of instability; ablating the Chair role yielded the largest reduction in divergence for the HL-01 scenario.
Reducing the memory window (context length) functions as an effective intervention to attenuate divergence.

📚 Prerequisite Knowledge

Prerequisites

Dynamical systems theory (Lyapunov exponents)
Large Language Models (sampling temperature, context windows)
Multi-agent systems (roles, voting)

Key Terms

Lyapunov exponent: A measure of chaos that quantifies the rate of separation of infinitesimally close trajectories; positive values indicate exponential divergence.

Simplex: A geometric space representing probability distributions where components sum to 1 (used here for preference states like p_A + p_B + p_C = 1).

T=0: Temperature zero; a sampling setting for LLMs that selects the most likely token deterministically, usually expected to produce identical outputs for identical inputs.

Ablation: Systematically removing a component (like a specific agent role) to measure its contribution to the overall system behavior.

Homogeneous committee: A committee where all agents use the same underlying LLM provider and model.

Heterogeneous committee: A committee composed of different LLM models/families (e.g., mixing GPT, Claude, Gemini).