Analyzing and Improving Coherence of Large Language Models in Question Answering

📝 Paper Summary

Modularized RAG pipeline Query rewriting / query generation

The paper identifies that LLMs often fail to answer semantically equivalent questions consistently and proposes a Retrieval-Augmented Generation method using retrieved similar questions to improve both accuracy and coherence.

Core Problem

LLMs suffer from instability and lack of coherence, meaning they often generate different (and sometimes incorrect) outputs when given diverse but semantically equivalent lexical variations of the same question.

Why it matters:

Inconsistency erodes user trust; a model should answer 'What is Italy's capital?' and 'Name the capital of Italy' identically.
Previous black-box prompt engineering attempts to fix stability are ad-hoc; a principled approach is needed to help models access their parametric knowledge reliability.
Standard RAG retrieves documents, but this paper argues retrieving *similar questions* triggers different semantic patterns that help the model 'understand' the request better.

Concrete Example: A model might correctly answer 'How old was jacqueline wilson when her first book got published?' but fail on 'What was the age of Jacqueline Wilson when she experienced the publication of her initial book?', indicating a failure to access knowledge due to phrasing.

Key Novelty

Question-RAG (q-RAG) / Question Prompting

Instead of retrieving documents (standard RAG), retrieve semantically equivalent questions (Support Questions) from a large pre-computed index.
Feed these retrieved questions (and optionally their pre-computed answers) into the LLM context to help it 'disambiguate' the intent and trigger the correct parametric knowledge.
Use redundant information (question variations) to stabilize the model's understanding rather than just adding missing facts.

Architecture

The end-to-end QA pipeline illustrating the Question-RAG approach.

Evaluation Highlights

Question-RAG leads to a 4-8 percentage point improvement in end-to-end performance on factual QA tasks compared to standard prompting.
On PopQA-TP, coherence (semantic similarity of answers across variations) improved significantly: Mixtral-8x7B went from 53.21 to 81.21.
Llama2-70b accuracy on PopQA-TP increased from 54.20% (base) to 62.73% (question prompt).

Breakthrough Assessment

7/10

Simple yet effective insight: retrieving similar questions helps LLM 'understanding' more than just retrieving facts. Strong empirical gains on coherence, though the method relies on a massive pre-indexed question database.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering focusing on robustness to lexical variation.

Inputs: Natural language question q

Outputs: Answer a

Pipeline Flow

Question Retrieval System (QRS) retrieves k similar questions
Prompt Construction (combines original query + retrieved questions)
LLM Generation (produces final answer)

System Modules

Question Retrieval System (QRS)

Find Support Questions (SQs) semantically similar to the input

Model or implementation: Fine-tuned MiniLM-12L-v2 (bi-encoder)

LLM Generator

Generate answer using original question and retrieved SQs context

Model or implementation: Mixtral-8x7B / Llama2-70b / Smaug-72b / Phi-3

Novel Architectural Elements

Inclusion of retrieved *questions* (lexical variations) into the prompt context specifically to trigger parametric knowledge, rather than just factual documents.

Modeling

Base Model: Evaluated on multiple: Mixtral-8x7B, Llama2-70b, Smaug-72b, Phi-3

Training Method: Zero-shot prompting with retrieved context (Inference only)

Compute: 8xV100 32GB GPUs used for running LLMs.

Comparison to Prior Work

vs. Standard RAG: Retrieves *questions* (SQs) to stabilize intent understanding, whereas standard RAG retrieves *documents* to fill knowledge gaps.
vs. CoT: Shows that external retrieval of SQs outperforms internal CoT-style generation of similar questions.
vs. Zero-shot: Adds redundancy via SQs to improve coherence.

Limitations

Relies on a massive pre-computed index of 38M questions; maintenance and coverage of this index are non-trivial.
Smaug-72b showed performance drops on Question Ranking (QR) dataset with this method, suggesting not all fine-tuned models benefit equally.
Coherence metric used (semantic similarity of answers) may not fully capture correctness if all answers are consistently wrong.
Analysis limited to models up to 72B parameters; did not test GPT-4 or Gemini.

📊 Experiments & Results

Evaluation Setup

Zero-shot QA with retrieval augmentation.

Benchmarks:

PopQA-TP (Entity-centric QA with paraphrases)
Question Ranking (QR) (Question similarity/ranking)
Open Domain QA Suite (General QA)

Metrics:

Accuracy (Exact Match / Human Annotated Correctness)
Coherence (average embedding similarity between answers to equivalent questions)
Naturalness (Human evaluation)
Statistical methodology: Human annotation via Amazon Mechanical Turk for Open Domain QA; Exact Match for PopQA-TP.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Impact of Question Prompting on PopQA-TP accuracy and coherence across multiple models.
PopQA-TP	EM (Exact Match)	40.29	46.95	+6.66
PopQA-TP	Coherence	53.21	81.21	+28.00
PopQA-TP	EM (Exact Match)	54.20	62.73	+8.53
Comparison of retrieval sources (Questions vs Paragraphs) on Open Domain QA (Mixtral).
Open Domain QA (Avg across NQ, Quora, PAQ, TriviaQA)	Correctness	0.72	0.74	+0.02
Comparison of Support Question generation methods (Retrieval vs Generation).
Open Domain QA	Correctness	78.0	79.3	+1.3

Experiment Figures

Histograms showing the distribution of correct answers per cluster (size 5) for different models on PopQA-TP.

Accuracy of Mixtral on Open Domain QA as the number of retrieved items (k) increases.

Main Takeaways

Retrieving similar questions (q-RAG) improves coherence and accuracy more than standard paragraph retrieval for certain factual QA tasks.
LLMs generally struggle to be coherent (give same answer to rephrased questions); providing variations in context mitigates this.
Retrieved questions are qualitatively different from LLM-generated questions; retrieved ones introduce new facets/vocabulary that trigger better recall, while generated ones are often too synonymous/redundant.
Adding the *answers* to the retrieved questions (Q/A prompt) provides a further boost, combining the benefits of intent disambiguation and factual grounding.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Large Language Models (LLMs) prompting
Semantic similarity / Embedding models

Key Terms

Coherence: The ability of a model to generate semantically equivalent outputs when receiving diverse yet semantically equivalent input variations.

Support Questions (SQs): Semantically equivalent or similar questions retrieved from an external index to aid the LLM in understanding the input query.

PopQA-TP: A dataset of 118K paraphrased questions used to benchmark semantic consistency across question variations.

Bi-encoder: A model architecture where two separate neural networks encode the input and candidate independently, allowing efficient pre-computation of embeddings for retrieval.

Parametric knowledge: Information stored implicitly in the weights of the pre-trained neural network, as opposed to external knowledge sources.