Harnessing the Power of Reinforcement Learning for Language-Model-Based Information Retriever via Query-Document Co-Augmentation

📝 Paper Summary

Modularized RAG pipeline Retrieval

CoAugRetriever uses reinforcement learning to jointly optimize an LLM that augments both user queries and corpus documents, aligning their semantic representations for better retrieval performance.

Core Problem

Current LLM-based retrieval methods focus only on query rewriting, which is insufficient for challenging corpora where documents themselves are semantically distant from queries or lack sufficient context.

Why it matters:

Enhancing queries alone hits a bottleneck in challenging knowledge domains where accurately retrieving information from a compact corpus is crucial
Simply allowing an LLM to modify documents without coordination yields little benefit and can even degrade performance due to misalignment
Jointly training query and document augmentation is difficult because the reward depends on the interaction of both, creating an intractable action space

Concrete Example: For a query 'carcinogens' targeting a document about '(210)Po', standard models generate related but mismatched terms (query: 'DNA', 'genetic mutation'; document: 'radioactivity'). CoAugRetriever aligns them by generating the shared terms 'radiation' and 'risk' in both the query and document expansions, enabling a successful match.

Key Novelty

Bidirectional Reinforcement Learning for Co-Augmentation

Treats both query augmentation and document augmentation as collaborative policies learned by the same LLM via RL
Uses a 'composite sampling' strategy that groups queries with relevant/irrelevant documents into a single batch to make joint training computationally feasible
Introduces a multi-sampling reward estimation that averages retrieval scores across different rollout combinations to handle the large joint action space

Architecture

The training pipeline showing Batch Sampling, Rollout generation for both queries and documents, and the Group-wise Reward/Advantage computation.

Evaluation Highlights

Achieves 5%–7% improvement in NDCG@10 over baseline BM25 on in-domain datasets (NFCorpus, SciFact, FiQA-2018)
Significantly outperforms query-augmentation-only and document-augmentation-only baselines, proving the necessity of collaborative training
Demonstrates strong cross-benchmark generalization: models trained on one dataset improve performance on unseen domains compared to the base Qwen2.5-7B model

Breakthrough Assessment

8/10

Proposes a novel bidirectional RL framework that effectively solves the coordination problem between query and document augmentation, yielding significant gains where single-sided augmentation fails.

⚙️ Technical Details

Problem Definition

Setting: Information Retrieval where both queries and documents can be modified/augmented before matching

Inputs: User query q and document collection D

Outputs: Augmented query q' and augmented documents D' used for retrieval ranking

Pipeline Flow

Batch Sampling (Groups queries with pos/neg docs)
Augmentation (LLM generates expansions for both)
Retrieval & Reward (Rank docs using augmented text)
Update (Policy Gradient on LLM)

System Modules

Batch Sampler

Constructs mini-datasets containing a query, relevant documents, and irrelevant documents

Model or implementation: N/A

Augmenter

Generates additional text/keywords for both queries and documents

Model or implementation: Qwen2.5-7B (shared policy)

Reward Calculator

Computes retrieval effectiveness of augmented pairs

Model or implementation: Underlying Retriever (BM25 or BGE)

Novel Architectural Elements

Query-Document Composite Sampling: Batches are structured as {query + relevant_docs + irrelevant_docs} rather than independent samples
Multi-sampling Reward Strategy: Randomly selects 1 document rollout per doc while evaluating all query rollouts to approximate the full combinatorial reward matrix efficiently

Modeling

Base Model: Qwen2.5-7B

Training Method: Reinforcement Learning (Custom Policy Gradient)

Objective Functions:

Purpose: Optimize the augmentation policy to maximize retrieval performance.

Formally: Policy gradient using advantage A computed from NDCG scores.
Purpose: Stabilize training by centering rewards within groups.

Formally: A = (r - r_mean) (No division by std dev, unlike GRPO).

Adaptation: Full fine-tuning (implied, as RL updates are applied to the model)

Trainable Parameters: Model parameters of Qwen2.5-7B

Training Data:

BEIR benchmark datasets (NFCorpus, SciFact, FiQA-2018)

Key Hyperparameters:

max_steps: 300 per dataset
reward_metric: NDCG@10

Compute: Not explicitly reported in the paper (mentions 'computationally expensive' and resource constraints limiting steps)

Comparison to Prior Work

vs. Query-Only: CoAugRetriever modifies the target corpus representation to align with queries
vs. GRPO/REINFORCE++: Uses centralization-only advantage (no std-dev normalization) to handle identical rewards and query difficulty variance

Limitations

Computational cost of training is high due to RL and document processing
Generalization in dense retrieval settings is variable across domains (negative transfer on FiQA)
Requires re-indexing the entire document corpus after training (inference overhead)
Experiments limited to 300 steps and one base model size due to resource constraints

Reproducibility

Code: https://github.com/liujm2001/CoAugRetriever

Code is publicly available at https://github.com/liujm2001/CoAugRetriever. Experiments use open datasets (BEIR). Base model is open weights (Qwen2.5-7B). Specific hyperparameters like learning rate or batch size scalars are not explicitly detailed in the text.

📊 Experiments & Results

Evaluation Setup

Information Retrieval on BEIR benchmark datasets

Benchmarks:

NFCorpus (Bio-medical Information Retrieval)
SciFact (Scientific Fact Verification/Retrieval)
FiQA-2018 (Financial Question Answering)

Metrics:

NDCG@10
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison showing CoAugRetriever improves over baselines in sparse (BM25) settings.
NFCorpus (Sparse)	NDCG@10	0.3218	0.3800	+0.0582
SciFact (Sparse)	NDCG@10	0.6789	0.7289	+0.0500
FiQA-2018 (Sparse)	NDCG@10	0.2361	0.3479	+0.1118
Ablation study on NFCorpus demonstrating the necessity of collaborative training versus single-sided augmentation.
NFCorpus (Sparse)	NDCG@10	0.3496	0.3800	+0.0304
NFCorpus (Sparse)	NDCG@10	0.3015	0.3800	+0.0785

Experiment Figures

Comparison of advantage computation methods (GRPO vs REINFORCE++ vs Ours)

Case study comparing word distributions in augmented texts

Main Takeaways

Collaborative training is essential: Document augmentation alone can degrade performance, and query augmentation alone hits a ceiling. Joint training aligns the two distributions.
Custom RL formulation required: Standard GRPO fails because within-group normalization amplifies noise when rewards are identical. The proposed centralization-only approach is stable.
Cross-entropy analysis confirms alignment: The word distribution divergence between augmented queries and documents is significantly lower for the collaboratively trained model.
Robust generalization in sparse retrieval: The policy learned on one dataset transfers well to others for BM25, though dense retrieval generalization is more mixed.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient methods)
Information Retrieval metrics (NDCG)
Sparse (BM25) vs. Dense (Embedding-based) Retrieval

Key Terms

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items

BM25: Best Matching 25—a probabilistic retrieval function used to rank documents based on query terms appearing in each document

BGE: BAAI General Embedding—a state-of-the-art dense retrieval model that maps text to vector embeddings

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs generated from the same input to reduce variance

REINFORCE++: A variant of the REINFORCE algorithm that uses batch-wide normalization instead of group-wise normalization

rollout: A single execution path of the model (generating an augmentation) used to estimate rewards

sparse retrieval: Retrieval based on matching specific keywords (tokens) between query and document

dense retrieval: Retrieval based on semantic similarity between vector representations of query and document

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards derived from human preferences