Aligned Query Expansion: Efficient Query Expansion for Information Retrieval through LLM Alignment

📝 Paper Summary

Modularized RAG pipeline

Aligned Query Expansion (AQE) fine-tunes large language models to generate query expansions that directly maximize retrieval effectiveness, eliminating the need for costly post-generation filtering steps.

Core Problem

Generative query expansion using LLMs often produces hallucinations or suboptimal queries, and current solutions rely on a computationally expensive 'generate-then-filter' paradigm.

Why it matters:

Current filtering methods require generating dozens of queries and running a relevance model on each, increasing latency and cost
Standard LLMs are not inherently aligned to prioritize terms that maximize downstream retrieval metrics like BM25 ranking
Vocabulary mismatch remains a critical bottleneck in sparse retrieval systems where user queries do not match document terms

Concrete Example: A user asks about 'symptoms of flu'. A standard LLM might generate 50 expansions, some irrelevant or hallucinated. Current methods (like EAR) must generate all 50 and use a separate ranker to filter them, wasting compute. AQE's model is trained to generate only the effective terms in one shot.

Key Novelty

Direct Alignment for Query Expansion (AQE)

Instead of filtering outputs after generation, AQE fine-tunes the generator itself using reinforcement learning techniques (RSFT or DPO) to prefer expansions that result in better retrieval rankings.
It treats the retrieval rank of the ground-truth document as the reward signal, aligning the generation probability with retrieval success.

Architecture

The training pipeline for Aligned Query Expansion (AQE)

Evaluation Highlights

Reduces inference latency by approximately 70% compared to generate-then-filter approaches like EAR
Outperforms baseline methods (GAR, EAR) in retrieval effectiveness across both in-domain and out-of-domain datasets
Demonstrates significant gains in Recall@1000 and MRR@10 compared to standard zero-shot prompting

Breakthrough Assessment

7/10

Offers a strong efficiency improvement (70% latency reduction) while maintaining or improving accuracy. Applying alignment techniques (DPO) directly to the retrieval objective is a logical and effective step forward.

⚙️ Technical Details

Problem Definition

Setting: Open-domain passage retrieval and question answering

Inputs: User query q

Outputs: Expanded query e (which is then used to retrieve document d)

Pipeline Flow

Training: Generate Expansions → Rank by Retrieval Metric → Fine-tune (RSFT or DPO)
Inference: Input Query → Aligned LLM (Greedy Decoding) → Expanded Query → Sparse Retrieval (BM25)

System Modules

Expansion Generator

Generate query expansions optimized for retrieval

Model or implementation: LLM (specific architecture not explicitly named in paper text, likely T5 or Llama based on context of similar works)

Retriever

Retrieve documents using the expanded query

Model or implementation: BM25 (Sparse Retrieval)

Novel Architectural Elements

Integration of retrieval rank directly into the preference optimization loop (DPO/RSFT) for query expansion

Modeling

Base Model: Not explicitly reported in the paper

Training Method: Direct Preference Optimization (DPO) and Rejection Sampling Fine-Tuning (RSFT)

Objective Functions:

Purpose: Maximize likelihood of best expansions (RSFT).

Formally: L_RSFT = -E[log P_phi(e_best | q)]
Purpose: Optimize preference for better expansions over worse ones (DPO).

Formally: L_DPO = -E[log sigma(beta * log(P_theta(e_best|q)/P_ref(e_best|q)) - beta * log(P_theta(e_worst|q)/P_ref(e_worst|q)))]

Training Data:

Zero-shot generation of N=50 expansions per query using base LLM
Evaluation of each expansion via BM25 retrieval rank of the ground truth document
Selection of best (highest rank) and worst (lowest rank) expansions for preference pairs

Key Hyperparameters:

num_generations_N: 50

Compute: Not reported in the paper

Comparison to Prior Work

vs. Doc2Query: Expands the query at inference time rather than documents at indexing time
vs. GAR: Uses aligned generation to optimize retrieval metrics directly rather than standard seq2seq training
vs. EAR: Eliminates the filtering step by fine-tuning the generator, reducing latency

Limitations

Relies on sparse retrieval (BM25) for reward signal; impact on dense retrieval not explicitly detailed
Requires relevant documents (q, d pairs) for training the alignment
Base LLM architecture and size not specified in the text provided

Reproducibility

No code URL or specific model weights provided in the text. Paper mentions concurrent work deploying similar methods in industry but does not provide open artifacts.

📊 Experiments & Results

Evaluation Setup

Open-domain passage retrieval

Benchmarks:

Not explicitly named in text (Passage Retrieval)

Metrics:

Retrieval Effectiveness (Metric names like Recall or MRR implied but specific values not in text)
Latency
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper claims significant improvements but the specific numeric tables are not included in the provided text. The following is the only explicit numeric claim available.
Not specified	Latency reduction	100	30	-70

Main Takeaways

AQE eliminates the need for expensive generate-then-filter pipelines by aligning the model to generate effective queries directly
The method leverages recent alignment techniques (DPO, RSFT) to optimize for retrieval rank
Significant latency reduction (~70%) makes it suitable for real-time systems
Performance improvements are observed in both in-domain and out-of-domain settings

📚 Prerequisite Knowledge

Prerequisites

Understanding of sparse retrieval (BM25)
Familiarity with Large Language Models (LLMs) and prompting
Knowledge of LLM alignment techniques (RLHF, DPO)

Key Terms

AQE: Aligned Query Expansion—the proposed method of fine-tuning LLMs to generate retrieval-optimized query expansions

BM25: Best Matching 25—a probabilistic information retrieval function used to rank documents based on query terms

DPO: Direct Preference Optimization—an alignment method that optimizes a model to prefer one response over another without an explicit reward model

RSFT: Rejection Sampling Fine-Tuning—a method where the model is fine-tuned only on the best outputs sampled from its own previous generations

RLHF: Reinforcement Learning from Human Feedback—a technique to align LLMs using a reward model trained on human preferences

GAR: Generation-Augmented Retrieval—a baseline method that expands queries by generating relevant contexts like answers or titles

EAR: Expand and Rerank—a baseline method that generates multiple query expansions and uses a reranker to select the best one

BoN: Best-of-N—a decoding strategy that samples N responses and selects the best one based on a reward model

zero-shot prompting: Asking an LLM to perform a task without providing any specific training examples in the prompt

hallucination: When an LLM generates content that is factually incorrect or irrelevant to the source material

vocabulary mismatch: The problem where terms in a user's query do not literally match the terms in the relevant documents