QPaug: Question and Passage Augmentation for Open-Domain Question Answering of LLMs

📝 Paper Summary

Modularized RAG pipeline Query rewriting / query generation

QPaug improves Open-Domain QA by augmenting the original question with LLM-generated sub-questions to enhance retrieval, and augmenting retrieved contexts with LLM-generated factual passages to guide answer extraction.

Core Problem

Retrieval-Augmented Generation (RAG) often fails on complex/ambiguous questions because standard retrievers fetch irrelevant passages, and readers struggle when retrieved contexts are distracting or incomplete.

Why it matters:

Standard retrievers (like BM25 or dense retrieval) often miss relevant documents for complex multi-hop questions
LLMs have vast parametric knowledge that is often overridden or ignored when relying solely on potentially noisy retrieved contexts
Fine-tuning LLMs to handle retrieval noise is computationally expensive or impossible for black-box APIs

Concrete Example: For the question 'Who is the spouse of the director of film Eden And After?', standard retrieval fetches irrelevant biographical info about actors. QPaug first decomposes this into 'Identify the Director' -> 'Research the Director's spouse', retrieves better documents, and self-generates a passage containing the correct spouse (Catherine Robbe-Grillet), enabling the correct answer.

Key Novelty

Dual-stage augmentation via In-Context Learning (QPaug)

Question Augmentation (Qaug): Uses Chain-of-Thought prompting to decompose a complex question into a plan of sub-questions, which are appended to the query to guide the retriever toward more relevant documents.
Passage Augmentation (Pgen): Explicitly prompts the LLM to generate a 'factual' passage from its own parametric knowledge (or output [NONE]), which is then treated as an additional context document alongside retrieved ones.

Architecture

The QPaug workflow: Question Augmentation (Step 2), Passage Retrieval (Step 3-1), Passage Self-Generation (Step 3-2), and Answer Prediction (Step 4).

Evaluation Highlights

Outperforms SuRE (previous SOTA) by +10.4% F1 on Natural Questions and +34.2% F1 on HotpotQA using GPT-3.5 and Contriever
Achieves large gains on multi-hop datasets: +34.2% Rouge improvement on 2WikiMultihopQA using GPT-4 with SBERT compared to no retrieval
Question augmentation alone improves retrieval Recall@10 by up to 30% with GPT-4 compared to standard retrieval

Breakthrough Assessment

8/10

Simple yet highly effective prompting strategy that harmonizes parametric and non-parametric knowledge. Significant gains on multi-hop benchmarks without requiring model fine-tuning.

⚙️ Technical Details

Problem Definition

Setting: Open-Domain Question Answering (ODQA) using a retrieve-and-read setup

Inputs: Natural language question q

Outputs: Predicted answer â

Pipeline Flow

Question Augmentation (Input → CoT decomposition → Augmented Query)
Passage Retrieval (Augmented Query → Vector Search → Top-K Passages)
Passage Self-Generation (Augmented Query → LLM Generation → Generated Passage)
Answer Prediction (Original Question + Retrieved Passages + Generated Passage → Final Answer)

System Modules

Question Augmenter

Decompose original question into sub-questions/plans using Zero-shot CoT

Model or implementation: GPT-3.5, GPT-4, or LLaMA-2-Chat (same as reader)

Retriever

Retrieve relevant documents using the augmented query

Model or implementation: Contriever, ANCE, or SBERT (dense retrievers)

Passage Generator (Generation)

Generate a synthetic passage from parametric knowledge to complement retrieval

Model or implementation: GPT-3.5, GPT-4, or LLaMA-2-Chat

Reader / Answer Predictor (Generation)

Generate the final answer using both retrieved and generated passages

Model or implementation: GPT-3.5, GPT-4, or LLaMA-2-Chat

Novel Architectural Elements

Dual-source context construction: The prompt for the final reader concatenates K retrieved passages with 1 self-generated passage (labeled 'Your Knowledge')
Recursive usage of LLM for both query planning (Qaug) and context creation (Pgen) before the final reading step

Modeling

Base Model: Evaluated with LLaMA-2-7b-chat, GPT-3.5-turbo, and GPT-4

Comparison to Prior Work

vs. SuRE: QPaug augments the query (pre-retrieval) and adds parametric knowledge (post-retrieval) rather than summarizing retrieved text.
vs. Self-RAG: QPaug uses standard LLMs without fine-tuning specialized reflection tokens; it relies on explicit prompting for 'step-by-step' planning and 'factual' generation.
vs. RAPTOR: QPaug focuses on query decomposition and parametric augmentation rather than hierarchical summarization of the corpus [not cited in paper].

Limitations

Relies heavily on the LLM's capability to generate accurate sub-questions and factual passages; hallucinations in Pgen can still occur.
Inference cost is higher due to multiple LLM calls (decomposition, passage generation, final answer) compared to standard RAG.
The method was evaluated in a zero-shot setting; performance in few-shot or fine-tuned settings is not explored.

Reproducibility

Code: https://github.com/kmswin1/QPaug

Publicly available code (https://github.com/kmswin1/QPaug). Uses standard ODQA benchmarks (NQ, 2Wiki, HotpotQA) and standard retrievers (SBERT, ANCE, Contriever). Prompt templates provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Zero-shot Open-Domain QA with retrieval from 21M Wikipedia passages

Benchmarks:

Natural Questions (NQ) (Single-hop QA)
2WikiMultihopQA (2wiki) (Multi-hop reasoning QA)
HotpotQA (Multi-hop reasoning QA)

Metrics:

Exact Match (EM) / Accuracy (implied via Rouge-L/F1 context)
F1 score
Rouge-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
QPaug significantly outperforms baseline methods (Chain-of-Thoughts, Rerank, Self-verification, SuRE) across all datasets using Contriever and GPT-3.5.
Natural Questions (NQ)	F1	40.4	44.6	+4.2
HotpotQA	F1	33.6	45.1	+11.5
2WikiMultihopQA	F1	32.6	35.5	+2.9
Ablation on Passage Generation (Pgen) shows that adding a self-generated passage consistently improves F1 scores, especially on multi-hop datasets where retrieval is difficult.
2WikiMultihopQA	F1	36.5	47.8	+11.3
Ablation on Question Augmentation (Qaug) shows substantial improvements in retrieval recall.
HotpotQA	Recall@10	47.47	62.08	+14.61

Experiment Figures

Performance gains (Recall@K) of Question Augmentation (Qaug) using GPT-4 compared to a base Contriever.

F1-scores of QPaug vs Base RAG with varying numbers of top-K grounded passages on 2wiki.

Main Takeaways

QPaug consistently outperforms baselines across various retrievers (SBERT, ANCE, Contriever) and LLMs (GPT-3.5, GPT-4, Llama-2).
Question Augmentation (Qaug) is particularly effective for retrieval recall, decomposing complex questions to find evidence that standard queries miss.
Passage Generation (Pgen) acts as a crucial fallback; when retrieval fails (common in multi-hop), the LLM's parametric knowledge often recovers the correct answer.
The approach is most beneficial for multi-hop datasets (HotpotQA, 2Wiki) where standard retrieval struggles the most.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) pipeline basics
Chain-of-Thought (CoT) prompting
Dense Passage Retrieval (DPR) concepts

Key Terms

RAG: Retrieval-Augmented Generation—combining a retriever to find documents and a generator to answer questions based on them

ODQA: Open-Domain Question Answering—answering questions using a large collection of documents without a specific context provided upfront

Chain-of-Thought (CoT): Prompting technique that encourages LLMs to generate intermediate reasoning steps before the final answer

Parametric knowledge: Knowledge stored within the weights of the pre-trained Large Language Model itself

Non-parametric knowledge: External knowledge retrieved from a database (e.g., Wikipedia) during inference

MIPS: Maximum Inner Product Search—algorithm used to efficiently find the most similar vectors (documents) to a query vector

Recall@K: Evaluation metric measuring the proportion of relevant documents found in the top-K retrieved results