Multi-agent collaborationUncertainty quantification in LLMs
The paper improves multi-agent reasoning by integrating a third-party LLM to estimate agent confidence and dynamically adjusting attention weights based on uncertainty, reducing hallucinations in complex tasks.
Core Problem
Standard multi-agent debate systems often use identical models for all agents, leading to monolithic viewpoints and 'hallucination consensus' where agents agree on wrong answers due to restricted knowledge scopes.
Why it matters:
Homogeneous multi-agent systems lack external feedback, limiting the depth and breadth of debate needed for complex reasoning
Without distinct viewpoints or confidence calibration, agents may reinforce each other's errors rather than correcting them
Current methods often treat all agent contributions equally, failing to prioritize more confident or reliable reasoning paths
Concrete Example:In an arithmetic problem like '3+27*3+7', three identical agents might incorrectly agree on '97' due to shared biases. The proposed method introduces an external agent (ERNIE) to evaluate confidence; if ERNIE signals low confidence in the group's consensus, the system lowers attention to those answers, preventing the error propagation.
Key Novelty
Uncertainty-Driven Third-Party Integration
Introduces a heterogeneous 'third-party' agent (ERNIE) into a homogeneous multi-agent group (Llama) to break monolithic consensus
Calculates a confidence score for each agent's response based on logits and consistency
Dynamically scales the attention weights of the primary model (Llama) to focus more on agents with higher confidence scores during the debate process
Architecture
The workflow of the proposed fine-grained reasoning method. It shows the interaction between the user question, the dialogue agents, and the attention weight update mechanism.
Evaluation Highlights
Achieved 94.0% accuracy on an arithmetic dataset, outperforming the standard multi-agent baseline (47.8%)
Surpassed previous uncertainty-based methods like TokenSAR (50.0%) and Entropy-based attention (51.8%) by a significant margin
Demonstrated that attention scaling based on third-party confidence (Attn-All) yields higher accuracy than standard oracle methods (73.2%)
Breakthrough Assessment
4/10
Shows a very large improvement on a specific arithmetic task, but the dataset is small (100 samples) and the scope is limited to arithmetic. The mechanism of attention scaling based on external confidence is interesting but needs broader validation.
⚙️ Technical Details
Problem Definition
Setting: Multi-round multi-agent debate for arithmetic reasoning
Inputs: Arithmetic questions in the form 'a + b * c + d'
Outputs: Final numerical answer derived from consensus
Pipeline Flow
Input Question
Multi-Agent Debate (Round 1-3)
Uncertainty Estimation (via Third-Party)
Attention Weight Update
Final Consensus Generation
System Modules
Agent Cohort
Generate initial answers and reasoning steps
Model or implementation: Llama3 (Agents 1-3)
Third-Party Observer
Provide an external viewpoint and estimate confidence/uncertainty of the debate state
Model or implementation: ERNIE (Agent 4)
Attention Scaler
Adjust the attention weights of the primary model based on calculated confidence
Model or implementation: Custom Algorithm (modifies Llama Attention)
Novel Architectural Elements
Integration of a heterogeneous third-party model (ERNIE) specifically for confidence estimation within a Llama-based multi-agent loop
Dynamic modification of Transformer attention weights (Attention Scaling Range Weights) based on external agent confidence scores during inference
Modeling
Base Model: Llama3 (Agents 1-3) and ERNIE (Agent 4)
Training Method: Inference-time intervention (Prompting + Attention Manipulation)
Adaptation: None (In-context learning and architectural intervention during inference)
Compute: Inference requires running 4 concurrent agent instances (3 Llama, 1 ERNIE) for 3 rounds. Specific GPU requirements not reported.
Comparison to Prior Work
vs. Standard Debate: Introduces a third-party heterogeneous model (ERNIE) to break homogeneous consensus
vs. TokenSAR: Modifies attention weights based on agent-level confidence ranges rather than just token-level relevance
vs. ReConcile: Adjusts internal attention mechanism during generation rather than just voting on final outputs
Limitations
Experiments limited to a small dataset of 100 arithmetic problems
Computational overhead increases significantly due to running multiple large models (4 agents)
Generalization to non-arithmetic domains (e.g., logical reasoning, creative writing) is untested
Relies on the availability and latency of a third-party model (ERNIE)
Reproducibility
Code availability is not provided. The method relies on the ERNIE API (Baidu) and Llama3 weights. Exact prompts are partially described in figures but no full prompt files are mentioned.
Attention Scaling: Multiplying the attention scores by a scalar weight derived from confidence metrics to prioritize reliable information sources
Logits: The raw, unnormalized prediction scores generated by the final layer of a neural network before the softmax activation
TokenSAR: Shifting Attention to more Relevant—a method that quantifies uncertainty by focusing on relevant tokens and sentences
Hallucination: When an LLM generates content that is nonsensical or unfaithful to the provided source content or real-world facts
Oracle: A baseline method that assumes access to the ground truth or an optimal selection strategy to establish an upper bound on performance
Range Weights: Weights assigned to specific segments of the input sequence (e.g., a specific agent's response) to modulate how much attention the model pays to that segment