Improving Factuality and Reasoning in Language Models through Multiagent Debate

📝 Paper Summary

Multi-agent collaboration Reasoning enhancement Hallucination mitigation

Multiple instances of a language model improve reasoning and factuality by proposing independent answers and iteratively critiquing each other to reach a consensus, mimicking a 'society of minds'.

Core Problem

Large language models often hallucinate facts and make invalid reasoning jumps because they generate text in a single pass without a mechanism to cross-examine or verify their own outputs against alternative viewpoints.

Why it matters:

Current models confidently state incorrect facts (hallucinations) or commit logical errors in complex chains of thought, limiting their reliability in high-stakes domains
Single-agent prompting techniques like Chain-of-Thought or Self-Reflection rely on a single model's internal consistency, which can still be flawed or get stuck in local optima

Concrete Example: When solving an arithmetic problem like '10+20*23+3-11*18', individual agents might output 269 or 369. Through debate, Agent 2 critiques Agent 1's calculation, they check intermediate steps, and finally converge on the correct answer (275), which neither initially proposed.

Key Novelty

Multi-agent Debate / Society of Minds

Instantiate multiple copies of an LLM as separate agents that first generate independent answers to a query
Execute iterative rounds where each agent reads the responses of all other agents from the previous round and updates its own answer based on their critiques and insights
Mimics a human group discussion where diverse initial views are debated until a consensus (quorum) is reached, correcting individual errors

Evaluation Highlights

+12.8% accuracy boost on Arithmetic tasks using debate compared to single-agent generation (67.0% -> 81.8%)
+8.0% improvement on Grade School Math (GSM8K) accuracy over single-agent baseline (77.0% -> 85.0%)
+31.5 improvement in Chess Move pawn score advantage compared to single-agent (91.4 -> 122.9), significantly outperforming standard reflection methods

Breakthrough Assessment

8/10

Simple yet highly effective method that requires no training, works with black-box models, and demonstrates that social dynamics (debate) can correct intrinsic model errors better than self-reflection alone.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot generation of answers to reasoning and factual queries using black-box access to LLMs

Inputs: Natural language query q (math problem, chess state, or biography request)

Outputs: Final consensus answer a_final derived after T rounds of debate among N agents

Pipeline Flow

Initialization: N agent instances generate independent answers to Query
Debate Round 1: Agents receive concatenated answers from all other agents
Update: Agents critique peer answers and generate revised responses
Repeat: Debate continues for T rounds until consensus or limit reached
Final Output: The converged answer provided by the agents

System Modules

Agent Instances

Generate initial solutions and critique peer solutions to refine own answer

Model or implementation: chatGPT (gpt-3.5-turbo-0301) or Bard

Debate Orchestrator

Manage rounds, concatenate peer responses, and prompt agents for updates

Model or implementation: Deterministic script

Novel Architectural Elements

Iterative multi-round context injection where peer outputs become inputs for the next generation step
Application of 'consensus prompts' that explicitly instruct models to treat other instances' outputs as advice to be critiqued

Modeling

Base Model: chatGPT (gpt-3.5-turbo-0301) for main experiments; Google Bard used in comparative ablation

Training Method: Inference-time prompting only (no weight updates)

Adaptation: None (Prompting only)

Trainable Parameters: 0 (Black-box access)

Compute: Inference cost scales linearly with N (agents) * T (rounds). Main experiments use 3 agents and 2 rounds.

Comparison to Prior Work

vs. Self-Consistency: Debate allows agents to *change* their minds based on peer reasoning, rather than just voting on fixed initial outputs
vs. Self-Reflection: Debate provides external stimuli (peer answers) which breaks the model out of its own incorrect internal state/hallucination
vs. MAD (Multi-Agent Debate) [concurrent work, not cited]: Similar concept, but this paper specifically analyzes the trade-off between debate length and prompt 'stubbornness'

Limitations

Computational cost increases linearly with the number of agents and debate rounds
Context length limits can be reached quickly if peer responses are long (mitigated by summarization)
Consensus does not guarantee correctness; models can converge on a shared incorrect hallucination if the bias is strong
Evaluation relies on specific model versions (chatGPT-3.5) which change over time

Reproducibility

Code: https://composable-models.github.io/llm_debate/

publicly available (https://composable-models.github.io/llm_debate/). Code provided. Prompts for all tasks (Arithmetic, GSM8K, Chess, Biographies) are listed in the Appendix. Evaluation relied on gpt-3.5-turbo-0301 which is a black-box API.

📊 Experiments & Results

Evaluation Setup

Zero-shot prompting on reasoning and factuality benchmarks

Benchmarks:

Arithmetic (Mathematical reasoning (6-term expressions))
GSM8K (Grade school math word problems)
Chess Move Prediction (Strategic reasoning (predict next move from PGN))
Biographies (Factual generation (computer scientists)) [New]
MMLU (General knowledge (multiple choice))
Chess Move Validity (Constraint satisfaction (legal moves))

Metrics:

Accuracy (%)
Stockfish Pawn Score (Chess)
Fact Recall (Biographies)
Statistical methodology: Standard deviation reported across runs (e.g., ±4.7)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Reasoning tasks: Debate consistently outperforms single-agent and reflection baselines on math and strategy.
Arithmetic	Accuracy (%)	67.0	81.8	+14.8
GSM8K	Accuracy (%)	77.0	85.0	+8.0
Chess Move Prediction	Pawn Score Advantage (∆PS)	91.4	122.9	+31.5
Factuality tasks: Debate significantly reduces hallucinations compared to single agents.
Biographies	Fact Accuracy (%)	66.0	73.8	+7.8
MMLU	Accuracy (%)	63.9	71.1	+7.2
Chess Move Validity	Validity (%)	29.3	45.2	+15.9

Experiment Figures

Performance trends on Arithmetic task as a function of (a) Number of Agents and (b) Debate Rounds.

Qualitative example of uncertainty in Biographies. Agents give conflicting birthplaces for a scientist (Spain vs Cuba).

Main Takeaways

Multi-agent debate improves performance across all tested reasoning and factuality tasks, outperforming both single-agent generation and self-reflection.
Diversity in initial responses is key; even when all agents start with incorrect answers, the debate process allows them to critique flaws and converge on a correct solution.
Increasing the number of agents and rounds of debate generally monotonically increases accuracy, though gains diminish after ~4 rounds.
Factuality improves because agents are 'uncertain' about hallucinations in different ways; debate filters out inconsistent facts (lies) while preserving consistent truths.
The method is orthogonal to Chain-of-Thought; combining Debate with CoT yields even higher performance.

📚 Prerequisite Knowledge

Prerequisites

Prompt engineering (Chain of Thought)
Basic understanding of Large Language Models (LLMs)
Concept of multi-agent systems or ensemble methods

Key Terms

Society of Mind: A theory by Marvin Minsky proposing that intelligence emerges from the interaction of many simple, non-intelligent agents; here applied to LLM instances

hallucination: A phenomenon where an LLM generates text that is factually incorrect or nonsensical but appears plausible

Chain-of-Thought: A prompting technique that encourages the model to generate intermediate reasoning steps before the final answer

Self-Reflection: A technique where a single model generates an output and then is prompted to critique and refine its own output

MMLU: Massive Multitask Language Understanding—a benchmark measuring knowledge across 57 subjects like math, history, and law

GSM8K: Grade School Math 8K—a dataset of 8.5k high quality linguistically diverse grade school math word problems

PGN notation: Portable Game Notation—a standard plain text format for recording chess games

Stockfish: A strong open-source chess engine used here to evaluate the quality (pawn score advantage) of predicted moves