← Back to Paper List

Corex: Pushing the Boundaries of Complex Reasoning through Multi-Model Collaboration

Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Xipeng Qiu, Lingpeng Kong
Shanghai AI Laboratory, The University of Hong Kong, Fudan University, East China Normal University
arXiv (2023)
Agent Reasoning QA

📝 Paper Summary

Multi-agent collaboration LLM reasoning strategies
Corex improves LLM reasoning by treating models as collaborative agents that discuss, review, and retrieve solutions, mimicking human social problem-solving rather than relying on single-model generation.
Core Problem
Single LLMs often hallucinate or make cumulative errors in complex reasoning tasks, and self-correction mechanisms struggle to fix these mistakes without external feedback.
Why it matters:
  • Current reasoning strategies like Chain-of-Thought (CoT) are confined to a static 'black box' relying solely on internal representations, leading to unreliable answers.
  • Majority voting methods (like Self-Consistency) are computationally expensive and can be overwhelmed if incorrect answers dominate the sample pool.
  • LLMs frequently struggle to self-correct their own reasoning chains, meaning initial errors often propagate to the final answer.
Concrete Example: In a math word problem about bee populations, an LLM might correctly set up the equation '7x = 700' but then hallucinate the calculation 'x = 90' instead of 100. A single agent often fails to catch this arithmetic error, leading to a wrong final answer (360 instead of 400).
Key Novelty
Corex (Collaboration of Reasoning Experts)
  • Discuss Mode: Agents are split into 'Blue' and 'Green' teams to debate answers; a Judge agent evaluates their reasoning to find a consensus or select the best argument.
  • Review Mode: One agent generates a solution (text or code), and other agents sequentially scrutinize and refine it, similar to code review in software engineering.
  • Retrieve Mode: Instead of majority voting on answers alone, a Retriever agent scores the faithfulness between reasoning chains and their conclusions to select the most reliable pair.
Evaluation Highlights
  • Outperforms standard Chain-of-Thought (CoT) by +13.6% accuracy on the GSM-Hard mathematical reasoning benchmark (Review-NL mode).
  • Achieves 91.1% accuracy on Big-Bench Symbolic Reasoning (Review-Code mode), surpassing CoT-SC (Self-Consistency) which scored 80.5%.
  • Reduces inference costs significantly: matches CoT-SC performance using only 5-10% of the token consumption typically required for majority voting methods.
Breakthrough Assessment
7/10
Provides a robust framework for multi-agent reasoning that consistently beats strong baselines like CoT-SC and PAL. While the individual components (debate, review) are known concepts, their integration and cost-effectiveness analysis are strong contributions.
×