← Back to Paper List

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, A. Torralba, J. Tenenbaum, Igor Mordatch
Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, Google Brain
International Conference on Machine Learning (2023)
Agent Factuality Reasoning Benchmark

📝 Paper Summary

Multi-agent collaboration Reasoning enhancement Hallucination mitigation
Multiple instances of a language model improve reasoning and factuality by proposing independent answers and iteratively critiquing each other to reach a consensus, mimicking a 'society of minds'.
Core Problem
Large language models often hallucinate facts and make invalid reasoning jumps because they generate text in a single pass without a mechanism to cross-examine or verify their own outputs against alternative viewpoints.
Why it matters:
  • Current models confidently state incorrect facts (hallucinations) or commit logical errors in complex chains of thought, limiting their reliability in high-stakes domains
  • Single-agent prompting techniques like Chain-of-Thought or Self-Reflection rely on a single model's internal consistency, which can still be flawed or get stuck in local optima
Concrete Example: When solving an arithmetic problem like '10+20*23+3-11*18', individual agents might output 269 or 369. Through debate, Agent 2 critiques Agent 1's calculation, they check intermediate steps, and finally converge on the correct answer (275), which neither initially proposed.
Key Novelty
Multi-agent Debate / Society of Minds
  • Instantiate multiple copies of an LLM as separate agents that first generate independent answers to a query
  • Execute iterative rounds where each agent reads the responses of all other agents from the previous round and updates its own answer based on their critiques and insights
  • Mimics a human group discussion where diverse initial views are debated until a consensus (quorum) is reached, correcting individual errors
Evaluation Highlights
  • +12.8% accuracy boost on Arithmetic tasks using debate compared to single-agent generation (67.0% -> 81.8%)
  • +8.0% improvement on Grade School Math (GSM8K) accuracy over single-agent baseline (77.0% -> 85.0%)
  • +31.5 improvement in Chess Move pawn score advantage compared to single-agent (91.4 -> 122.9), significantly outperforming standard reflection methods
Breakthrough Assessment
8/10
Simple yet highly effective method that requires no training, works with black-box models, and demonstrates that social dynamics (debate) can correct intrinsic model errors better than self-reflection alone.
×