Multi-expert Prompting Improves Reliability, Safety and Usefulness of Large Language Models

📝 Paper Summary

Multi-agent Prompt Engineering Safety and Alignment

Multi-expert Prompting simulates multiple experts within a single LLM to generate diverse perspectives, then aggregates them using the Nominal Group Technique to improve truthfulness and safety without fine-tuning.

Core Problem

Single-expert prompting strategies (like ExpertPrompting) bias the model toward a narrow, potentially incorrect viewpoint and struggle with open-ended questions requiring multifaceted answers.

Why it matters:

Single perspectives often fail to address the complexity of open-ended queries (e.g., ethical dilemmas), leading to dismissive or biased responses.
Blind reliance on a single generated expert identity can amplify hallucinations or falsehoods if that specific persona is ill-suited or biased.

Concrete Example: When asked 'Is it ethical to eat meat?', a single-expert prompt might adopt a strict Ethicist persona and simply say 'No, it is unethical.' Multi-expert Prompting generates a Doctor, Physiotherapist, and Surgeon to discuss nutrition and health, offering a nuanced answer that acknowledges multiple valid viewpoints.

Key Novelty

Nominal Group Technique (NGT) for LLM Aggregation

Simulates multiple diverse experts (identities + short descriptions) in parallel to answer an instruction.
Aggregates these expert responses in a single turn using a 7-step process derived from the Nominal Group Technique (NGT), a human decision-making framework.
Explicitly identifies agreed, conflicted, and isolated viewpoints before synthesizing a final answer and selecting the best option.

Architecture

The two-step workflow of Multi-expert Prompting: (1) Expert & Response Generation and (2) Expert Response Aggregation.

Evaluation Highlights

Achieves state-of-the-art 89.35% truthfulness on TruthfulQA with ChatGPT, outperforming the best baseline by 8.69%.
Reduces toxicity to 0.00% on BOLD dataset using Mistral-7B, completely eliminating detected toxic content compared to baselines.
Wins 76.5% of usefulness comparisons against baselines on ExpertQA open-ended questions.

Breakthrough Assessment

8/10

Significant improvement in truthfulness and safety via a training-free prompting strategy. The adaptation of a formal management science technique (NGT) to LLM reasoning is a novel and effective mechanism.

⚙️ Technical Details

Problem Definition

Setting: Long-form text generation under constraints of truthfulness, safety, and usefulness.

Inputs: Natural language instruction I

Outputs: Final response A, selected from individual expert responses {A_1...A_n} and an aggregated response A_comb

Pipeline Flow

Expert Generation: LLM generates n expert identities (Role + 1-sentence description).
Expert Response: LLM generates n independent responses, one for each expert identity.
Aggregation (7 Subtasks): LLM performs 7 steps in a single CoT to merge responses (Agreed -> Conflicted -> Resolved -> Isolated -> Collected -> Aggregated Response -> Selection).

System Modules

Expert Generator (Expert Generation)

Generate diverse expert identities tailored to the instruction.

Model or implementation: Same LLM as backbone (ChatGPT or Mistral)

Expert Responder (Expert Generation)

Generate answers from the perspective of each specific expert.

Model or implementation: Same LLM as backbone

Aggregator (Aggregation)

Synthesize expert responses using NGT-inspired subtasks.

Model or implementation: Same LLM as backbone

Selector (Aggregation)

Select the single best response among the aggregated one and individual expert ones.

Model or implementation: Same LLM as backbone

Novel Architectural Elements

7-step single-turn CoT aggregation prompt based on Nominal Group Technique (NGT)
Selection mechanism that chooses between the aggregated response vs. individual expert responses (rather than just refining one)

Modeling

Base Model: Evaluated on gpt-3.5-turbo-0613 (ChatGPT) and Mistral-7B-Instruct-v0.2

Compute: Inference-only. Requires n+2 calls per instruction (1 for expert gen, n for expert answers, 1 for aggregation).

Comparison to Prior Work

vs. ExpertPrompting: Uses multiple experts and aggregates viewpoints rather than relying on a single persona.
vs. Multi-agent Debate: Aggregates in a single turn using structured NGT steps rather than iterative multi-turn debate.
vs. USC: Synthesizes a new response from components (agreed/conflicted/unique points) rather than just selecting one existing response.

Limitations

Not ideal for short-form tasks (e.g., True/False, numerical reasoning) where complex aggregation is unnecessary overhead.
Requires strong instruction-following capabilities; weaker models may fail the 7-step aggregation prompt.
Treats all expert opinions equally (unweighted), which might not reflect real-world expertise disparity.
Risk of hallucinated expert identities or roles for obscure domains.

Reproducibility

Code: https://github.com/doxuanlong/multi-expert-prompting

Code and data public at https://github.com/doxuanlong/multi-expert-prompting. Prompts are detailed in Appendix C. No training required (inference-only method).

📊 Experiments & Results

Evaluation Setup

Zero-shot generation on standard reliability and safety benchmarks.

Benchmarks:

TruthfulQA (Truthfulness/Hallucination)
FactualityPrompt (Factuality)
BOLD (Toxicity/Bias)
HONEST (Hurtfulness)
ExpertQA (Open-ended QA (Informativeness/Usefulness))

Metrics:

True percentage (TruthfulQA)
Hallucinated NE Error (FactualityPrompt)
Toxicity percentage (BOLD)
HurtLex score (HONEST)
Win-rate (vs baselines on ExpertQA)
Statistical methodology: t-test for statistical significance (p < 0.01 reported)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TruthfulQA	True (%)	80.66	89.35	+8.69
TruthfulQA	True (%)	80.34	87.15	+6.81
FactualityPrompt	Non-factual Error	15.66	9.45	-6.21
BOLD	Toxicity	0.129	0.000	-0.129
HONEST	Hurtfulness (Queer)	0.038	0.004	-0.034
TruthfulQA	True (%)	82.37	89.35	+6.98

Experiment Figures

Win-rate comparison for Informativeness and Usefulness on ExpertQA.

Bar chart comparing Multi-expert Prompting against a baseline that is explicitly instructed to be truthful/factual/safe.

Main Takeaways

Multi-expert Prompting consistently outperforms single-expert and debate baselines across truthfulness, factuality, and safety metrics.
The aggregation step (NGT) is critical; naïve aggregation performs significantly worse.
Optimal performance is achieved with 3 experts; adding more experts (5, 10) yields diminishing returns or degrades performance due to noise.
The model selects the aggregated response over individual expert responses >90% of the time, validating the quality of the synthesis.

📚 Prerequisite Knowledge

Prerequisites

Prompt Engineering (Chain-of-Thought, Role-playing)
Large Language Models (LLMs)
Multi-agent Debate concepts

Key Terms

Nominal Group Technique (NGT): A structured decision-making process for groups that encourages independent idea generation followed by structured voting/aggregation to minimize social bias.

ExpertPrompting: A baseline method where the LLM generates a specific expert identity and then answers the query adopting that persona.

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer.

Zero-shot: Evaluating the model without providing any task-specific examples in the prompt.

TruthfulQA: A benchmark dataset designed to measure whether language models mimic human falsehoods or generate truthful answers.

BOLD: Bias in Open-Ended Language Generation Dataset—a benchmark for measuring toxicity and bias.

HONEST: A benchmark for measuring hurtful sentence completions in language models.

Krippendorff's alpha: A statistical measure of the agreement achieved when coding a set of units of analysis.