Problem Definition
Setting: Multi-agent collaborative reasoning for diverse tasks (math, symbolic, commonsense)
Inputs: Natural language query q
Outputs: Final reasoning chain c and prediction p (or executed code output)
Pipeline Flow
- Mode Selection: Choose collaboration paradigm (Discuss, Review, or Retrieve)
- Agent Assignment: Assign roles (e.g., Solver, Reviewer, Judge, Retriever) to LLMs
- Execution: Agents generate, critique, or rank solutions
- Final Selection: Judge or Retriever determines the final output
System Modules
Discuss Agents (Discussion)
Generate initial reasoning/answers and iteratively refine them based on group interaction
Model or implementation: GPT-3.5-Turbo (default)
Judge Agent (Discussion)
Evaluate the reasoning quality of conflicting teams and decide the final answer
Model or implementation: GPT-3.5-Turbo (or GPT-4/Claude in analysis)
Primary Agent (Review)
Generate the initial solution (reasoning chain or code)
Model or implementation: GPT-3.5-Turbo
Reviewer Agents (Review)
Scrutinize and modify the previous agent's solution
Model or implementation: GPT-3.5-Turbo
Retriever Agent
Score candidate reasoning chains based on faithfulness to the prediction
Model or implementation: GPT-3.5-Turbo
Novel Architectural Elements
- Sequential multi-agent code review pipeline where valid code is refined iteratively rather than just regenerated
- Faithfulness-based retrieval mechanism that ranks reasoning chains by internal consistency rather than just answer frequency
Modeling
Base Model: GPT-3.5-Turbo-0613 (Main experiments)
Compute: Inference-only (no training). Using 5 agents max per task.