UniRAG: A Unified RAG Framework for Knowledge-Intensive Queries with Decomposition, Break-Down Reasoning, and Iterative Rewriting

📝 Paper Summary

Modularized RAG pipeline Complex question answering

UniRAG unifies entity-based query decomposition, granular reasoning that verifies sub-facts independently, and iterative rewriting to handle complex knowledge-intensive queries better than standard RAG approaches.

Core Problem

Existing RAG methods struggle with complex queries because they often propagate early reasoning errors, fail to verify independent constraints simultaneously, or rely on incomplete retrieval without self-correction.

Why it matters:

Standard retrieval often misses nuanced information required for multi-hop or domain-specific questions (e.g., biomedical or legal)
Current reasoning methods (like Chain-of-Thought) can hallucinate steps or fail when early retrieval is noisy
Reliability is critical for knowledge-intensive tasks where answers must be fully grounded in evidence, not just plausible-sounding

Concrete Example: For a multi-hop question requiring two facts (e.g., 'Who is the director of the movie starring X?'), a standard retriever might find documents about X but miss the director. Without explicit verification of the missing 'director' constraint, the model might guess or hallucinate an answer based on partial context.

Key Novelty

Unified Decomposition-Reasoning-Rewriting Framework (UniRAG)

Decomposes queries based on extracted named entities (via FLERT) to ensure sub-queries focus on specific information needs
Uses a 'Let's Break It Down' prompting strategy that forces the LLM to verify each retrieved sub-fact independently before synthesizing an answer
Implements a self-correcting loop where, if evidence is insufficient, the query is rewritten based specifically on the identified reasoning gaps

Architecture

The complete UniRAG workflow, illustrating the three main phases: Entity-Grounded Decomposition, Break-Down Reasoning, and Iterative Rewriting.

Evaluation Highlights

+28.6% Exact Match improvement on HotPotQA (multi-hop) using LLaMA-3.1-8B compared to the best baseline (ITER-RETGEN)
+12.78% Accuracy improvement on MedQA (biomedical) using LLaMA-3.1-8B compared to standard Chain-of-Thought prompting
Achieves 76.6% Exact Match on HotPotQA with LLaMA-3.1-8B, surpassing even GPT-3.5-Turbo's performance with standard RAG

Breakthrough Assessment

7/10

Strong empirical gains across diverse benchmarks (multi-hop, biomedical, commonsense). While the individual components (decomposition, rewriting) are known, the unified integration and specific 'break-down' verification strategy show significant effectiveness.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation for knowledge-intensive queries (multi-hop, fact-verification, biomedical)

Inputs: Natural language query q

Outputs: Final answer generated after retrieval, reasoning, and optional iterative refinement

Pipeline Flow

Input Processing: Entity Extraction → Sub-query Generation
Retrieval & Selection: Retrieval (Original + Sub-queries) → Semantic Reranking
Reasoning & Verification: Break-Down Reasoning → RAGAs Confidence Check
Refinement (Conditional): Query Rewriting → Loop back to Decomposition (if low confidence)

System Modules

Entity Extractor (Input Processing)

Identify core entities to ground the decomposition process

Model or implementation: FLERT (NER model)

Decomposer (Input Processing)

Generate focused sub-queries based on extracted entities

Model or implementation: Main LLM (e.g., LLaMA-3.1-8B)

Retriever & Reranker

Fetch documents for q and all sq_i, then filter by semantic relevance

Model or implementation: GTE (Retriever) + mGTE (Reranker)

Reasoning Module

Verify each sub-fact independently and synthesize answer

Model or implementation: Main LLM (prompted with 'Let's Break It Down')

Evaluator / Rewriter

Assess answer confidence; if low, rewrite query based on missing info

Model or implementation: RAGAs (Evaluator) + Main LLM (Rewriter)

Novel Architectural Elements

Entity-grounded decomposition pipeline using a specialized external NER model (FLERT) before LLM processing
Integration of RAGAs (automated evaluation metric) as a runtime decision gate for iterative rewriting loops
Break-down reasoning prompt structure specifically designed for parallel constraint verification rather than sequential reasoning

Modeling

Base Model: Evaluated on multiple models: LLaMA-3.1-8B, Qwen-2.5-7B, Gemma-2-9B (White-box); GPT-3.5-Turbo, GPT-4o, Gemini-1.5-Flash (Black-box)

Key Hyperparameters:

reranking_threshold_theta: 0.8 to 0.9
sub_query_count: Optimal is 8 (for multi-hop), 2 (for fact verification)
max_iterations: Performance peaks around 2-3 iterations

Compute: Inference-only framework; requires API access or local hosting of LLMs + GTE/FLERT models. No training compute reported.

Comparison to Prior Work

vs. ITER-RETGEN: UniRAG uses entity-grounded decomposition and explicit 'break-down' verification rather than just using generation history for retrieval
vs. BlendFilter: UniRAG employs a feedback loop with query rewriting based on reasoning gaps, whereas BlendFilter focuses on filtering noisy knowledge upfront
vs. Self-Ask: UniRAG verifies retrieved sub-facts independently using a specific 'break-down' prompt rather than strictly sequential follow-ups
+ 1 more
vs. DSPy [not cited in paper]: UniRAG is a fixed prompting framework, whereas DSPy compiles and optimizes prompts automatically

Limitations

Relies on external RAGAs assessment for decision making, which may introduce biases or sensitivity to thresholds
Iterative process introduces latency compared to single-pass RAG
Performance depends on the quality of the external NER model (FLERT) and retriever

📊 Experiments & Results

Evaluation Setup

Open-domain QA and fact verification using retrieved knowledge (Wikipedia 2017 dump or PubMed abstracts)

Benchmarks:

HotPotQA (Multi-hop QA)
2WikiMultihopQA (Multi-hop QA)
StrategyQA (Commonsense reasoning)
MedMCQA (Biomedical QA)
MedQA (Biomedical QA)
SciFact (Fact verification)
FEVER (Fact verification)

Metrics:

Exact Match (EM)
F1 score
Accuracy (ACC)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparisons using LLaMA-3.1-8B show UniRAG significantly outperforming baselines on complex multi-hop datasets.
HotPotQA	Exact Match (EM)	0.488	0.766	+0.278
2WikiMultihopQA	Exact Match (EM)	0.412	0.778	+0.366
Performance comparisons using GPT-3.5-Turbo show consistent improvements, though margins are tighter than with LLaMA.
HotPotQA	Exact Match (EM)	0.498	0.584	+0.086
StrategyQA	Accuracy	0.700	0.730	+0.030
Prompting strategy ablation: 'Let's Break It Down' vs standard Chain-of-Thought (CoT) using LLaMA-3.1-8B.
MedQA	Accuracy	0.628	0.720	+0.092
FEVER	Accuracy	0.470	0.562	+0.092

Experiment Figures

Performance (EM and F1) on HotPotQA and 2WikiMultihopQA as a function of the number of sub-queries (2 to 10).

Main Takeaways

UniRAG consistently outperforms strong baselines (ITER-RETGEN, BlendFilter) across both open-source (LLaMA-3.1, Qwen) and closed-source (GPT-3.5/4o) models.
The 'Let's Break It Down' prompting strategy is superior to standard CoT for knowledge-intensive and fact-verification tasks (MedQA, FEVER), likely due to independent constraint verification.
Module-wise ablation confirms that the full combination of Decomposition, Rewriting, and Reranking yields the highest performance, with Decomposition providing the largest single jump.
Optimal sub-query count varies by task: ~8 for multi-hop reasoning (HotPotQA) vs ~2 for fact verification (SciFact), aligning with task complexity.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architectures
Named Entity Recognition (NER)
Prompt engineering techniques (Chain-of-Thought, ReAct)
Semantic similarity and reranking

Key Terms

FLERT: Document-level features for Named Entity Recognition—a specific model used here to extract entities for query decomposition

mGTE: Generalized Text Embeddings (multilingual)—a model used here for semantic reranking of retrieved documents

RAGAs: Retrieval Augmented Generation Assessment—a framework used here as a gatekeeper to score answer confidence (faithfulness, relevancy) and trigger rewriting if scores are low

Chain-of-Thought (CoT): A prompting technique encouraging models to generate intermediate reasoning steps

Break-Down Reasoning: The paper's specific prompting strategy instructing the LLM to verify each query constraint independently rather than sequentially

iterative retrieval: A process where the system repeatedly searches for information, refining the query based on previous results

Exact Match (EM): A metric measuring if the generated answer is character-for-character identical to the ground truth

Named Entity Recognition (NER): The task of identifying and classifying key information (names, organizations, locations) in text