Rq-rag: Learning to refine queries for retrieval augmented generation

📝 Paper Summary

Modularized RAG pipeline Query rewriting / query generation

RQ-RAG enhances retrieval-augmented generation by training a 7B model to explicitly rewrite, decompose, or disambiguate queries before searching, selecting the optimal refinement strategy via tree decoding.

Core Problem

Standard RAG methods often fail on ambiguous or complex queries because they use the original query indiscriminately for retrieval, and existing datasets lack explicit training for query refinement strategies.

Why it matters:

Indiscriminate retrieval for simple queries (like greetings) adds noise and degrades response quality
Complex queries requiring multi-hop reasoning cannot be answered by a single search using the original text
Ambiguous user intents require clarification or disambiguation before retrieval to provide accurate answers

Concrete Example: For a complex query, simply searching with the original text often fails to retrieve adequate information. Instead, the model should break it down into sub-queries (e.g., 'What is the population of X?' then 'What is the population of Y?'), search for those components, and synthesize the answer.

Key Novelty

Learning to Refine Query (RQ-RAG)

Trains a single Llama-2-7B model to dynamically choose between rewriting, decomposing, or disambiguating a query (or skipping retrieval) using special control tokens
Constructs a training dataset by using ChatGPT to generate refined queries and—crucially—regenerating the target answers based on the actual retrieval results to ensure context alignment
Uses a tree-decoding strategy at inference time to explore different refinement paths, selecting the best one based on model perplexity or confidence

Evaluation Highlights

+1.9% average accuracy improvement over Self-RAG (previous SOTA) on three single-hop QA datasets (Arc-Challenge, PopQA, OpenbookQA) using a 7B model
Significantly outperforms baselines on multi-hop datasets; e.g., +4.3% EM on HotpotQA compared to Self-RAG
Demonstrates high potential upper bound: if the oracle best trajectory is selected, performance jumps significantly (e.g., up to 63.6% accuracy on Arc-Challenge vs 52.7% current)

Breakthrough Assessment

7/10

Strong methodological contribution in unifying different query refinement strategies (rewrite/decompose/disambiguate) into one model with a novel data construction pipeline. Improvements over Self-RAG are consistent.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering (single-hop and multi-hop)

Inputs: Input query X

Outputs: Final answer Y obtained via a sequence of refinement actions and retrieval

Pipeline Flow

Input Query → Action Selection (Rewrite/Decompose/Disambiguate/Answer)
Refinement → Retrieval (DuckDuckGo) → Context Integration
Tree Decoding (exploring multiple paths) → Selection Strategy (PPL/Confidence/Ensemble) → Final Answer

System Modules

Generator / Refiner

Decides whether to refine the query (and how) or answer directly; generates the refined query and final answer

Model or implementation: Llama-2-7B

Retriever

Fetches relevant documents based on the refined query

Model or implementation: DuckDuckGo API (black box)

Selector

Selects the best trajectory from the tree decoding options

Model or implementation: Deterministic algorithm (PPL, Confidence, or Ensemble)

Novel Architectural Elements

Integrated refinement tokens: The model learns to output special tokens (rewrite, decompose, disambiguate) as the first step of generation to branch into different retrieval strategies
Tree-decoding selection mechanism: Selecting the final answer based on intrinsic model metrics (PPL/Confidence) across different refinement types rather than just beam search on tokens

Modeling

Base Model: Llama-2-7B

Training Method: Supervised Fine-Tuning (Auto-regressive)

Objective Functions:

Purpose: Maximize likelihood of generating the correct sequence of actions, refined queries, and answers.

Formally: L = E[log p_M(y | q_1, d_1, ..., x)]

Training Data:

Constructed ~40k instances using ChatGPT to generate refined queries and regenerate answers based on retrieved contexts
Source tasks: Multi-turn dialogue, decomposition tasks, disambiguation tasks

Key Hyperparameters:

model_size: 7B parameters

Compute: Not reported in the paper

Comparison to Prior Work

vs. Self-RAG: RQ-RAG explicitly modifies the query (rewrite/decompose) before search, whereas Self-RAG focuses on filtering/critiquing results from the original query
vs. SAIL: RQ-RAG regenerates training answers based on actual retrieval results to ensure grounding, rather than just appending search results to original targets
vs. query expansion methods [not cited in paper]: RQ-RAG integrates the decision to expand/refine into the generation model itself via control tokens, rather than as a separate pipeline step

Limitations

Dependency on external black-box search engine (DuckDuckGo) and proprietary models (ChatGPT) for data creation
Inference latency increases due to tree decoding (exploring multiple refinement branches)
Performance upper bound analysis suggests current selection strategies (PPL/Confidence) are suboptimal compared to an oracle

Reproducibility

Code: https://github.com/chanchimin/RQ-RAG

Code available at https://github.com/chanchimin/RQ-RAG. The retrieval relies on DuckDuckGo (external API). Data construction uses ChatGPT (GPT-3.5/GPT-4).

📊 Experiments & Results

Evaluation Setup

Zero-shot and fine-tuned settings on Single-hop and Multi-hop QA datasets

Benchmarks:

Arc-Challenge (Single-hop QA)
PopQA (Single-hop QA)
OpenbookQA (Single-hop QA)
HotpotQA (Multi-hop QA)
2WikiMultiHopQA (Multi-hop QA)
Musique (Multi-hop QA)

Metrics:

Accuracy (Acc)
Exact Match (EM)
F1 Score

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RQ-RAG demonstrates superior performance on Single-hop QA tasks compared to the strong baseline Self-RAG.
Arc-Challenge	Acc	50.5	52.7	+2.2
PopQA	Acc	28.5	30.3	+1.8
OpenbookQA	Acc	44.6	46.2	+1.6
RQ-RAG excels in Multi-hop QA tasks where query decomposition is critical.
HotpotQA	EM	23.3	27.6	+4.3
2WikiMultiHopQA	EM	22.2	23.7	+1.5

Main Takeaways

Query refinement (rewrite/decompose) is more effective than passive retrieval (standard RAG) for complex tasks
Contextualized data construction (regenerating answers based on search results) improves model grounding compared to using original static dataset answers
The 'Upper Bound' analysis shows significant room for improvement if the trajectory selection mechanism can be optimized further

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) concepts
Language Model fine-tuning
Tree decoding / Beam search strategies

Key Terms

RAG: Retrieval-Augmented Generation—enhancing LLMs by retrieving external documents to answer queries

Self-RAG: A baseline method that trains models to retrieve, generate, and critique their own outputs using special reflection tokens

PPL: Perplexity—a measurement of how well a probability model predicts a sample; lower PPL indicates the model is less 'surprised' by the text

Control tokens: Special tokens added to the vocabulary (e.g., SPECIAL_rewrite) to trigger specific model behaviors like rewriting or decomposing a query

Tree decoding: An inference strategy where the model explores multiple possible action sequences (branches) before selecting the best final output

DuckDuckGo: An internet search engine used here as the retrieval source

Single-hop QA: Questions that can be answered with a single piece of evidence

Multi-hop QA: Questions requiring reasoning across multiple documents or steps to derive an answer