Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models

📝 Paper Summary

Multi-agent collaboration Uncertainty quantification in LLMs

The paper improves multi-agent reasoning by integrating a third-party LLM to estimate agent confidence and dynamically adjusting attention weights based on uncertainty, reducing hallucinations in complex tasks.

Core Problem

Standard multi-agent debate systems often use identical models for all agents, leading to monolithic viewpoints and 'hallucination consensus' where agents agree on wrong answers due to restricted knowledge scopes.

Why it matters:

Homogeneous multi-agent systems lack external feedback, limiting the depth and breadth of debate needed for complex reasoning
Without distinct viewpoints or confidence calibration, agents may reinforce each other's errors rather than correcting them
Current methods often treat all agent contributions equally, failing to prioritize more confident or reliable reasoning paths

Concrete Example: In an arithmetic problem like '3+27*3+7', three identical agents might incorrectly agree on '97' due to shared biases. The proposed method introduces an external agent (ERNIE) to evaluate confidence; if ERNIE signals low confidence in the group's consensus, the system lowers attention to those answers, preventing the error propagation.

Key Novelty

Uncertainty-Driven Third-Party Integration

Introduces a heterogeneous 'third-party' agent (ERNIE) into a homogeneous multi-agent group (Llama) to break monolithic consensus
Calculates a confidence score for each agent's response based on logits and consistency
Dynamically scales the attention weights of the primary model (Llama) to focus more on agents with higher confidence scores during the debate process

Architecture

The workflow of the proposed fine-grained reasoning method. It shows the interaction between the user question, the dialogue agents, and the attention weight update mechanism.

Evaluation Highlights

Achieved 94.0% accuracy on an arithmetic dataset, outperforming the standard multi-agent baseline (47.8%)
Surpassed previous uncertainty-based methods like TokenSAR (50.0%) and Entropy-based attention (51.8%) by a significant margin
Demonstrated that attention scaling based on third-party confidence (Attn-All) yields higher accuracy than standard oracle methods (73.2%)

Breakthrough Assessment

4/10

Shows a very large improvement on a specific arithmetic task, but the dataset is small (100 samples) and the scope is limited to arithmetic. The mechanism of attention scaling based on external confidence is interesting but needs broader validation.

⚙️ Technical Details

Problem Definition

Setting: Multi-round multi-agent debate for arithmetic reasoning

Inputs: Arithmetic questions in the form 'a + b * c + d'

Outputs: Final numerical answer derived from consensus

Pipeline Flow

Input Question
Multi-Agent Debate (Round 1-3)
Uncertainty Estimation (via Third-Party)
Attention Weight Update
Final Consensus Generation

System Modules

Agent Cohort

Generate initial answers and reasoning steps

Model or implementation: Llama3 (Agents 1-3)

Third-Party Observer

Provide an external viewpoint and estimate confidence/uncertainty of the debate state

Model or implementation: ERNIE (Agent 4)

Attention Scaler

Adjust the attention weights of the primary model based on calculated confidence

Model or implementation: Custom Algorithm (modifies Llama Attention)

Novel Architectural Elements

Integration of a heterogeneous third-party model (ERNIE) specifically for confidence estimation within a Llama-based multi-agent loop
Dynamic modification of Transformer attention weights (Attention Scaling Range Weights) based on external agent confidence scores during inference

Modeling

Base Model: Llama3 (Agents 1-3) and ERNIE (Agent 4)

Training Method: Inference-time intervention (Prompting + Attention Manipulation)

Adaptation: None (In-context learning and architectural intervention during inference)

Compute: Inference requires running 4 concurrent agent instances (3 Llama, 1 ERNIE) for 3 rounds. Specific GPU requirements not reported.

Comparison to Prior Work

vs. Standard Debate: Introduces a third-party heterogeneous model (ERNIE) to break homogeneous consensus
vs. TokenSAR: Modifies attention weights based on agent-level confidence ranges rather than just token-level relevance
vs. ReConcile: Adjusts internal attention mechanism during generation rather than just voting on final outputs

Limitations

Experiments limited to a small dataset of 100 arithmetic problems
Computational overhead increases significantly due to running multiple large models (4 agents)
Generalization to non-arithmetic domains (e.g., logical reasoning, creative writing) is untested
Relies on the availability and latency of a third-party model (ERNIE)

Reproducibility

Code availability is not provided. The method relies on the ERNIE API (Baidu) and Llama3 weights. Exact prompts are partially described in figures but no full prompt files are mentioned.

📊 Experiments & Results

Evaluation Setup

Arithmetic reasoning tasks

Benchmarks:

Arithmetic Dataset (Mathematical calculation (a+b*c+d)) [New]

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Arithmetic Dataset	Accuracy	0.478	0.940	+0.462
Arithmetic Dataset	Accuracy	0.500	0.940	+0.440
Arithmetic Dataset	Accuracy	0.518	0.940	+0.422
Arithmetic Dataset	Accuracy	0.732	0.940	+0.208

Experiment Figures

The multi-agent debate setup involving 4 agents over 3 rounds.

Main Takeaways

Integrating a third-party model (ERNIE) significantly boosts accuracy in arithmetic tasks compared to homogeneous Llama-only teams.
Dynamically scaling attention weights based on confidence is more effective than simple prompting or voting mechanisms.
The method achieves near-perfect performance (94%) on the constructed arithmetic dataset, doubling the performance of standard baselines.

📚 Prerequisite Knowledge

Prerequisites

Transformer attention mechanisms (Query, Key, Value)
Multi-agent debate frameworks
Uncertainty quantification (Logits, Entropy)

Key Terms

Attention Scaling: Multiplying the attention scores by a scalar weight derived from confidence metrics to prioritize reliable information sources

Logits: The raw, unnormalized prediction scores generated by the final layer of a neural network before the softmax activation

TokenSAR: Shifting Attention to more Relevant—a method that quantifies uncertainty by focusing on relevant tokens and sentences

Hallucination: When an LLM generates content that is nonsensical or unfaithful to the provided source content or real-world facts

Oracle: A baseline method that assumes access to the ground truth or an optimal selection strategy to establish an upper bound on performance

Range Weights: Weights assigned to specific segments of the input sequence (e.g., a specific agent's response) to modulate how much attention the model pays to that segment