Problem Definition
Setting: Multi-round multi-agent debate for arithmetic reasoning
Inputs: Arithmetic questions in the form 'a + b * c + d'
Outputs: Final numerical answer derived from consensus
Pipeline Flow
- Input Question
- Multi-Agent Debate (Round 1-3)
- Uncertainty Estimation (via Third-Party)
- Attention Weight Update
- Final Consensus Generation
System Modules
Agent Cohort
Generate initial answers and reasoning steps
Model or implementation: Llama3 (Agents 1-3)
Third-Party Observer
Provide an external viewpoint and estimate confidence/uncertainty of the debate state
Model or implementation: ERNIE (Agent 4)
Attention Scaler
Adjust the attention weights of the primary model based on calculated confidence
Model or implementation: Custom Algorithm (modifies Llama Attention)
Novel Architectural Elements
- Integration of a heterogeneous third-party model (ERNIE) specifically for confidence estimation within a Llama-based multi-agent loop
- Dynamic modification of Transformer attention weights (Attention Scaling Range Weights) based on external agent confidence scores during inference
Modeling
Base Model: Llama3 (Agents 1-3) and ERNIE (Agent 4)
Training Method: Inference-time intervention (Prompting + Attention Manipulation)
Adaptation: None (In-context learning and architectural intervention during inference)
Compute: Inference requires running 4 concurrent agent instances (3 Llama, 1 ERNIE) for 3 rounds. Specific GPU requirements not reported.
Comparison to Prior Work
- vs. Standard Debate: Introduces a third-party heterogeneous model (ERNIE) to break homogeneous consensus
- vs. TokenSAR: Modifies attention weights based on agent-level confidence ranges rather than just token-level relevance
- vs. ReConcile: Adjusts internal attention mechanism during generation rather than just voting on final outputs
Limitations
- Experiments limited to a small dataset of 100 arithmetic problems
- Computational overhead increases significantly due to running multiple large models (4 agents)
- Generalization to non-arithmetic domains (e.g., logical reasoning, creative writing) is untested
- Relies on the availability and latency of a third-party model (ERNIE)
Reproducibility
Code availability is not provided. The method relies on the ERNIE API (Baidu) and Llama3 weights. Exact prompts are partially described in figures but no full prompt files are mentioned.