Diversify-verify-adapt: Efficient and Robust Retrieval-Augmented Ambiguous Question Answering

📝 Paper Summary

Modularized RAG pipeline Ambiguous Question Answering

Diva improves ambiguous question answering by diversifying retrieval via pseudo-interpretations and adaptively choosing between RAG and closed-book generation based on retrieval quality verification.

Core Problem

Single-step RAG often fails to retrieve passages covering all interpretations of an ambiguous question (low recall), while Iterative RAG (like ToC) is computationally expensive and slow.

Why it matters:

Over 50% of search queries are ambiguous, requiring systems to cover multiple valid user intents rather than a single answer
Current iterative methods require nearly 5.5 exploration steps per query, drastically increasing latency and API costs
RAG systems suffer significant factual accuracy degradation when retrieved passages contain noise or irrelevant information

Concrete Example: For the question 'Who played the Weasley brothers in Harry Potter?', a standard retriever might only find information about Ron Weasley, missing other brothers like Percy. Iterative approaches eventually find them but take too long. Diva infers 'Who played Ron?', 'Who played Percy?' upfront to retrieve all at once.

Key Novelty

Diversify-Verify-Adapt (Diva)

**Retrieval Diversification (RD):** mimics human reasoning to infer 'pseudo-interpretations' of an ambiguous question upfront, using them to retrieve a diverse set of passages in a single step rather than iteratively.
**Retrieval Verification (RV):** defines a new quality criterion (Useful, PartialUseful, Useless) for ambiguous QA and uses an LLM to grade whether retrieved passages cover the inferred interpretations.
**Adaptive Generation (AG):** dynamically selects the best strategy: use RAG for useful/partial passages, or fall back to the LLM's internal knowledge (closed-book) if retrieval is deemed 'Useless'.

Architecture

Comparison of Vanilla RAG, Iterative RAG (ToC), and the proposed Diva framework architectures.

Evaluation Highlights

Outperforms state-of-the-art Iterative RAG (ToC) by +1.9 D-F1 on ASQA (Ambiguous QA benchmark) using GPT-3.5.
Achieves ~3x faster inference speed compared to Iterative RAG (ToC) while maintaining superior accuracy.
Reduces input token consumption by >50% compared to Iterative RAG methods.

Breakthrough Assessment

7/10

Strong practical contribution addressing the latency/cost bottleneck of Iterative RAG while improving accuracy. The 'verify and adapt' mechanism effectively handles retrieval failure, a common RAG pain point.

⚙️ Technical Details

Problem Definition

Setting: Ambiguous Question Answering where a question q_i has multiple interpretations Q_i and corresponding answers A_i.

Inputs: Ambiguous question q_i

Outputs: Comprehensive response r_i covering all plausible answers A_i

Pipeline Flow

Group 1: Retrieval Diversification (RD) → Infer Pseudo-Interpretations → Retrieve & Prune
Group 2: Adaptive Generation (AG) → Retrieval Verification (RV) → Conditional Generation

System Modules

Pseudo-Interpretation Generator (Retrieval Diversification)

Identify ambiguous parts of the question and infer potential distinct interpretations

Model or implementation: LLM (GPT-3.5 or GPT-4)

Diverse Retriever (Retrieval Diversification)

Retrieve passages relevant to each pseudo-interpretation to ensure coverage

Model or implementation: ColBERT (retriever) + SentenceBERT (pruning)

Retrieval Verifier (Adaptive Generation)

Classify the retrieved passages as Useful, PartialUseful, or Useless

Model or implementation: LLM (GPT-3.5 or GPT-4)

Adaptive Generator (Adaptive Generation)

Generate the final answer using either RAG or closed-book LLM based on verification

Model or implementation: LLM (GPT-3.5 or GPT-4)

Novel Architectural Elements

Pre-retrieval disambiguation logic (Pseudo-interpretations) to parallelize coverage instead of iterative loops
Conditional routing architecture (Adaptive Generation) that bypasses RAG generation if retrieval is verified as 'Useless'

Modeling

Base Model: GPT-3.5-Turbo-Instruct and GPT-4

Comparison to Prior Work

vs. ToC: Diva generates interpretations upfront (non-iterative) and selectively uses RAG, whereas ToC iteratively retrieves for every branch.
vs. Self-RAG: Diva focuses specifically on 'ambiguous' coverage using pseudo-interpretations, while Self-RAG uses general quality tokens.
vs. Vanilla RAG: Diva explicitly models multiple interpretations and verifies retrieval quality before generation.

Limitations

Relies on the LLM's ability to infer pseudo-interpretations without external context; if the LLM doesn't know the ambiguity exists, it cannot diversify.
Pruning mechanism depends on SentenceBERT similarity, which might miss subtle semantic connections.
Effectiveness of the 'Useless' fallback depends entirely on the LLM's parametric knowledge (closed-book performance).

Reproducibility

No code URL provided in the paper. Prompts are provided in Appendix D. Uses public datasets (ASQA, SituatedQA) and standard metrics (D-F1, EM).

📊 Experiments & Results

Evaluation Setup

Ambiguous QA using few-shot prompting on frozen LLMs.

Benchmarks:

ASQA (Ambiguous Question Answering)
SituatedQA (Context-dependent Question Answering)

Metrics:

Disambiguation F1 (D-F1)
Exact Match (EM)
Rouge-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on ASQA dataset showing Diva's superiority over baselines.
ASQA	D-F1	46.3	48.2	+1.9
ASQA	D-F1	51.8	53.6	+1.8
Efficiency comparison demonstrating substantial speedups over the iterative ToC approach.
ASQA	Inference Time (sec/query)	15.5	5.5	-10.0
Ablation studies validating the contributions of RD and AG components.
ASQA	D-F1	47.1	48.2	+1.1
ASQA	D-F1	45.1	48.2	+3.1

Experiment Figures

Analysis of retrieval quality and its impact on RAG performance.

Efficiency comparison (Token Usage) between Vanilla RAG, ToC, and Diva.

Main Takeaways

Diva consistently outperforms iterative baselines (ToC) and single-step RAG methods across varying LLM backbones (GPT-3.5, GPT-4).
The Adaptive Generation (AG) module effectively identifies 'Useless' retrieval scenarios, switching to closed-book generation which yields better results than using noisy passages.
Efficiency gains are significant: Diva reduces token consumption and latency by ~60-70% compared to ToC because it avoids multi-step iterative retrieval loops.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Open-domain Question Answering
Sentence Embeddings (e.g., SentenceBERT)

Key Terms

D-F1: Disambiguation F1 score—a metric evaluating how well the generated answer covers all unique disambiguated answers for an ambiguous question

Pseudo-interpretations: Inferred potential meanings of an ambiguous question generated by an LLM to guide diverse retrieval

Iterative RAG: A RAG approach that repeatedly retrieves and generates to refine answers, often at high computational cost

ToC: Tree of Clarifications—a state-of-the-art iterative RAG method that builds a tree of disambiguations

ColBERT: A dense retrieval model that uses late interaction to match query and document tokens

SentenceBERT: A modification of the BERT network to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity

ASQA: Ambiguous SQuAD—a dataset specifically designed for ambiguous question answering

SituatedQA: A QA dataset where answers depend on context (time, location), introducing ambiguity