Corex: Pushing the Boundaries of Complex Reasoning through Multi-Model Collaboration

📝 Paper Summary

Multi-agent collaboration LLM reasoning strategies

Corex improves LLM reasoning by treating models as collaborative agents that discuss, review, and retrieve solutions, mimicking human social problem-solving rather than relying on single-model generation.

Core Problem

Single LLMs often hallucinate or make cumulative errors in complex reasoning tasks, and self-correction mechanisms struggle to fix these mistakes without external feedback.

Why it matters:

Current reasoning strategies like Chain-of-Thought (CoT) are confined to a static 'black box' relying solely on internal representations, leading to unreliable answers.
Majority voting methods (like Self-Consistency) are computationally expensive and can be overwhelmed if incorrect answers dominate the sample pool.
LLMs frequently struggle to self-correct their own reasoning chains, meaning initial errors often propagate to the final answer.

Concrete Example: In a math word problem about bee populations, an LLM might correctly set up the equation '7x = 700' but then hallucinate the calculation 'x = 90' instead of 100. A single agent often fails to catch this arithmetic error, leading to a wrong final answer (360 instead of 400).

Key Novelty

Corex (Collaboration of Reasoning Experts)

Discuss Mode: Agents are split into 'Blue' and 'Green' teams to debate answers; a Judge agent evaluates their reasoning to find a consensus or select the best argument.
Review Mode: One agent generates a solution (text or code), and other agents sequentially scrutinize and refine it, similar to code review in software engineering.
Retrieve Mode: Instead of majority voting on answers alone, a Retriever agent scores the faithfulness between reasoning chains and their conclusions to select the most reliable pair.

Evaluation Highlights

Outperforms standard Chain-of-Thought (CoT) by +13.6% accuracy on the GSM-Hard mathematical reasoning benchmark (Review-NL mode).
Achieves 91.1% accuracy on Big-Bench Symbolic Reasoning (Review-Code mode), surpassing CoT-SC (Self-Consistency) which scored 80.5%.
Reduces inference costs significantly: matches CoT-SC performance using only 5-10% of the token consumption typically required for majority voting methods.

Breakthrough Assessment

7/10

Provides a robust framework for multi-agent reasoning that consistently beats strong baselines like CoT-SC and PAL. While the individual components (debate, review) are known concepts, their integration and cost-effectiveness analysis are strong contributions.

⚙️ Technical Details

Problem Definition

Setting: Multi-agent collaborative reasoning for diverse tasks (math, symbolic, commonsense)

Inputs: Natural language query q

Outputs: Final reasoning chain c and prediction p (or executed code output)

Pipeline Flow

Mode Selection: Choose collaboration paradigm (Discuss, Review, or Retrieve)
Agent Assignment: Assign roles (e.g., Solver, Reviewer, Judge, Retriever) to LLMs
Execution: Agents generate, critique, or rank solutions
Final Selection: Judge or Retriever determines the final output

System Modules

Discuss Agents (Discussion)

Generate initial reasoning/answers and iteratively refine them based on group interaction

Model or implementation: GPT-3.5-Turbo (default)

Judge Agent (Discussion)

Evaluate the reasoning quality of conflicting teams and decide the final answer

Model or implementation: GPT-3.5-Turbo (or GPT-4/Claude in analysis)

Primary Agent (Review)

Generate the initial solution (reasoning chain or code)

Model or implementation: GPT-3.5-Turbo

Reviewer Agents (Review)

Scrutinize and modify the previous agent's solution

Model or implementation: GPT-3.5-Turbo

Retriever Agent

Score candidate reasoning chains based on faithfulness to the prediction

Model or implementation: GPT-3.5-Turbo

Novel Architectural Elements

Sequential multi-agent code review pipeline where valid code is refined iteratively rather than just regenerated
Faithfulness-based retrieval mechanism that ranks reasoning chains by internal consistency rather than just answer frequency

Modeling

Base Model: GPT-3.5-Turbo-0613 (Main experiments)

Compute: Inference-only (no training). Using 5 agents max per task.

Reproducibility

Code: https://github.com/QiushiSun/Corex

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and prompting
Chain-of-Thought (CoT) reasoning
Program-Aided Language Models (PAL/PoT)
Basic concepts of multi-agent systems

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

PAL: Program-Aided Language models—a method where the LLM generates code to solve reasoning problems, offloading computation to an interpreter

Self-Consistency (CoT-SC): A decoding strategy that samples multiple reasoning paths and selects the most frequent answer (majority voting)

Hallucination: When an LLM generates plausible-sounding but factually incorrect or nonsensical information

Faithfulness: The degree to which the generated reasoning explanation accurately reflects the process used to derive the answer