Not All Terms Matter: Recall-Oriented Adaptive Learning for PLM-aided Query Expansion in Open-Domain Question Answering

📝 Paper Summary

Query Expansion (QE) Sparse Retrieval

ReAL enhances retrieval accuracy by using a relevance classifier to iteratively learn and assign importance weights to expanded query terms, separating helpful terms from noise.

Core Problem

PLM-aided query expansion methods treat all generated terms uniformly, but many expanded terms are irrelevant or noisy, leading to suboptimal retrieval when used with sparse retrievers.

Why it matters:

Word mismatches in sparse retrieval lead to poor recall, which downstream readers cannot recover from
Current LLM-based expansions generate many common or weakly relevant words that dilute the impact of critical terms if not weighted properly
Existing term weighting methods (like SPLADE) are not designed to dynamically adapt to the specific relevance signals of PLM-generated expansions

Concrete Example: For the query 'who played jason in friday the 13th part 1', standard expansion adds terms like 'Friday', '13th', and 'killer'. A uniform-weight retriever retrieves documents about the movie franchise generally rather than the specific actor 'Ari Lehman', causing the reader to extract the wrong answer.

Key Novelty

Recall-oriented Adaptive Learning (ReAL)

Classify initial retrieved documents into pseudo-relevant and pseudo-irrelevant sets using a strong relevance model (like a cross-encoder)
Iteratively optimize a term weight vector to maximize the score gap between these two sets, effectively learning which expansion terms drive true relevance

Architecture

The iterative workflow of ReAL. It shows how the expanded query retrieves an initial list, which is then classified and used to optimize term weights.

Evaluation Highlights

+2.6% Hit@20 improvement on Natural Questions when applying ReAL to standard BM25 retrieval without query expansion
+1.4% Hit@20 gain on Natural Questions when adding ReAL to the state-of-the-art Query2Doc method
Consistent improvements across four ODQA datasets (NQ, TriviaQA, WebQuestions, CuratedTREC) and five different query expansion baselines

Breakthrough Assessment

7/10

Solid methodological improvement for sparse retrieval. It effectively bridges the gap between generative query expansion and lexical retrieval constraints, showing consistent gains across multiple baselines.

⚙️ Technical Details

Problem Definition

Setting: Open-Domain Question Answering using a Retriever-Reader architecture

Inputs: Natural language question q and an initial expanded query q_e

Outputs: A retrieved set of documents D_q and subsequently an extracted answer

Pipeline Flow

Initial Retrieval: Use expanded query to get top-N documents via sparse retriever
Relevance Classification: Use Cross-Encoder to split documents into pseudo-relevant and pseudo-irrelevant
Adaptive Learning: Iteratively optimize query term weights to separate the scores of these two sets
Final Retrieval: Re-run sparse retrieval with optimized weights

System Modules

Sparse Retriever (Retrieval & Selection)

Retrieve initial document set and provide token-level scores

Model or implementation: BM25

Relevance Classifier (Retrieval & Selection)

Classify retrieved documents as pseudo-relevant or pseudo-irrelevant

Model or implementation: cross-encoder/ms-marco-MiniLM-L-12

Adaptive Learner

Optimize term weight vector W_q to maximize separation between D_pr and D_pi

Model or implementation: Gradient Descent Optimizer (Adam)

Novel Architectural Elements

Integration of a differentiable term weighting layer into the sparse retrieval loop that is optimized at query-time via pseudo-relevance feedback
Dual-loss objective targeting both broad separation (Distinction) and top-tier separation (Separation) of relevant documents

Modeling

Base Model: BM25 (Retriever), cross-encoder/ms-marco-MiniLM-L-12 (Classifier)

Comparison to Prior Work

vs. Query2Doc: ReAL optimizes the *weights* of the generated terms dynamically, whereas Query2Doc treats them as static text
vs. SPLADE: ReAL optimizes weights at inference time based on specific retrieval feedback, rather than learning global sparse representations [not cited in paper as direct baseline, but discussed in related work]

Limitations

Depends on the quality of the initial retrieval; 'Bad' initial sets yield lower improvements
Increases computational latency due to the iterative optimization process (e.g., +1.67s for long queries)
Currently limited to sparse retrieval (BM25) and has not been fully explored with dense retrieval integration

Reproducibility

Code: https://github.com/process-cxr/ReAL

Code is publicly available. Relies on standard libraries (Sentence Transformers) and public datasets. Hyperparameters (learning rate, alpha, s, c) are explicitly detailed.

📊 Experiments & Results

Evaluation Setup

Open-domain QA using standard retriever-reader pipeline

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
TriviaQA (Open-domain QA)
WebQuestions (Open-domain QA)
CuratedTREC (Open-domain QA)

Metrics:

Hit@20
Hit@100
EM@20 (Exact Match)
LLM@20 (LLM-based evaluation)
Statistical methodology: Paired t-test (p < 0.01)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ReAL consistently improves Hit@20 retrieval accuracy across all datasets when applied to the Query2Doc baseline.
Natural Questions	Hit@20	71.77	73.43	+1.66
TriviaQA	Hit@20	79.26	80.11	+0.85
WebQuestions	Hit@20	75.39	76.62	+1.23
Natural Questions	EM@20	43.57	45.10	+1.53
Natural Questions	LLM@20	63.49	64.79	+1.30

Experiment Figures

A motivating example showing how traditional QE retrieves irrelevant docs due to weak terms like 'Friday', while ReAL upweights 'Jason' and 'Ari' to find the correct answer.

Main Takeaways

ReAL consistently enhances retrieval accuracy and end-to-end QA performance across all tested datasets and baselines.
The method is robust to the choice of relevance classifier (Cross-Encoder, Bi-Encoder, LLM), though Cross-Encoders offer a good balance of accuracy and speed.
Ablation studies show that both the Distinction and Separation loss functions, as well as the scaling post-processing step, are necessary for optimal performance.
The method improves results even when starting from 'Good' initial retrievals, but relies on having some relevant documents in the initial pool to learn effective weights.

📚 Prerequisite Knowledge

Prerequisites

Sparse retrieval (BM25)
Query Expansion techniques
Relevance feedback mechanisms
Cross-encoder vs. Bi-encoder architectures

Key Terms

Query Expansion (QE): The process of adding related terms to a user's query to improve the chances of matching relevant documents

PLM: Pre-trained Language Model—large neural networks trained on vast text data, used here to generate expansion terms

Sparse Retrieval: Retrieval methods like BM25 that match documents based on exact word overlap, as opposed to dense vector similarity

Pseudo-relevant: Documents retrieved by a first pass that are assumed to be relevant for the purpose of feedback or optimization, classified here by a model

Cross-encoder: A model that processes the query and document together to output a relevance score, typically more accurate but slower than bi-encoders

Hit@k: The percentage of queries for which at least one correct answer appears in the top-k retrieved documents

EM (Exact Match): A metric measuring if the predicted answer string exactly matches the ground truth

ODQA: Open-Domain Question Answering—answering questions using a large collection of documents (like Wikipedia) without a pre-specified context

SPLADE: Sparse Lexical and Expansion Model—a neural retrieval method that learns sparse representations for queries and documents