Reinforcement Fine-Tuning for History-Aware Dense Retriever in RAG

📝 Paper Summary

Modularized RAG pipeline

HARR optimizes the retriever component of RAG systems using reinforcement learning with a history-aware state representation to solve state aliasing in multi-hop reasoning.

Core Problem

Retrievers are typically optimized with proxy objectives (like supervised relevance) that misalign with the final answer quality, and standard RL is difficult due to deterministic retrieval and state ambiguity in multi-step reasoning.

Why it matters:

Independent optimization of retrievers and LLMs creates an objective mismatch: relevant documents might not lead to correct answers.
Scaling LLM fine-tuning is resource-intensive; optimizing the lighter retriever is more efficient but harder to align with end-to-end goals.
In multi-hop QA, the same query can arise from different reasoning contexts, confusing the retriever if it ignores history.

Concrete Example: In a multi-hop question, a query like 'Where was he born?' might appear twice. Without knowing the retrieval history (i.e., *who* 'he' refers to based on previous steps), the retriever cannot distinguish these states, leading to inconsistent rewards and failed learning.

Key Novelty

History-Aware Reinforced Retriever (HARR)

Replaces deterministic Top-k retrieval with probabilistic sampling (Plackett-Luce model) to create a stochastic policy optimizable by RL.
Augments the retriever's state with the full retrieval history (past queries and observations) to resolve ambiguity where identical queries imply different information needs.
Optimizes the retriever directly on the final answer F1 score using Group Relative Policy Optimization (GRPO), aligning retrieval behavior with downstream performance.

Architecture

The MDP formulation of HARR, showing the interaction between the Retriever (Agent) and the LLM (Environment).

Evaluation Highlights

Achieves consistent F1 improvements of +1.5 to +4.0 points over standard baselines on HotpotQA, 2WikiMultihopQA, and MuSiQue.
Outperforms LLM-centric optimization methods (like Self-RAG) while freezing the LLM, demonstrating the efficacy of lightweight retriever tuning.
Shows robust generalization across different retriever backbones (Contriever, BGE) and LLM sizes (7B, 13B).

Breakthrough Assessment

8/10

Effective application of RL to dense retrieval by addressing the specific challenges of discreteness and state aliasing. Strong empirical results with a lightweight approach.

⚙️ Technical Details

Problem Definition

Setting: Multi-hop Retrieval-Augmented Generation formulated as a Markov Decision Process (MDP)

Inputs: Initial query q0 and a large corpus of documents D

Outputs: Final answer y generated after T steps of retrieval and reasoning

Pipeline Flow

LLM generates sub-query q_t given history
Retriever samples k documents D_t based on q_t and history H_{t-1}
LLM generates observation o_t summarizing D_t
Repeat until termination, then LLM generates final answer y

System Modules

Retriever Policy

Select documents based on current state (history + query)

Model or implementation: Dense Retriever (e.g., Contriever-MS MARCO) with learnable state encoder

LLM Environment

Generate sub-queries, observations, and final answer

Model or implementation: Fixed LLM (e.g., Llama-2-7B-Chat)

Novel Architectural Elements

History-aware state encoder for the retriever that concatenates retrieval history with the current query to distinguish reasoning states

Modeling

Base Model: Retriever: Contriever-MS MARCO / BGE-Large-en-v1.5; LLM: Llama-2-7B-Chat / Llama-2-13B-Chat

Training Method: Reinforcement Learning (GRPO)

Objective Functions:

Purpose: Maximize expected terminal reward (F1 score).

Formally: J_GRPO(θ) = E[ (1/G) * sum( advantage * min(ratio, clip(ratio)) ) ]

Trainable Parameters: State encoder of the retriever only (Document encoder is frozen to save compute)

Key Hyperparameters:

group_size_G: 4 (implied by text logic, typical for GRPO)
clipping_epsilon: 0.2
temperature_alpha: not explicitly reported in snippet
+ 2 more
learning_rate: Not reported in the paper
top_k_candidate_pool: Approximation used: sample from top-K candidates (where K > k) instead of full corpus

Compute: Frozen document encoder allows precomputed embeddings; training only updates lightweight state encoder

Comparison to Prior Work

vs. ITER-RETGEN: HARR optimizes the retriever, whereas ITER-RETGEN uses a fixed retriever.
vs. Self-RAG: HARR optimizes the retriever and keeps the LLM fixed, avoiding expensive LLM fine-tuning.
vs. REPLUG: HARR uses direct RL with sparse rewards rather than knowledge distillation from the LLM.
+ 1 more
vs. Reward-RAG: HARR addresses multi-hop state aliasing with history-aware states and uses GRPO, while Reward-RAG typically uses single-step contrastive objectives.

Limitations

Relies on sparse terminal rewards (final answer F1), which can be difficult to credit assign in long horizons.
Requires ground truth answers for training.
Computational cost of sampling multiple trajectories (Group Relative Policy Optimization) during training.
Approximation of sampling from a top-K pool instead of the full corpus limits exploration.

Reproducibility

Code: https://github.com/zyc140345/HARR

Code is available at https://github.com/zyc140345/HARR. The paper mentions using Contriever and BGE as backbones. Hyperparameters like learning rate are not explicitly detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Multi-hop Question Answering

Benchmarks:

HotpotQA (Multi-hop QA)
2WikiMultihopQA (Multi-hop QA)
MuSiQue (Multi-hop QA)

Metrics:

F1 score
Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HARR consistently outperforms baselines on standard multi-hop QA datasets.
HotpotQA	F1	48.2	51.4	+3.2
2WikiMultihopQA	F1	45.1	47.3	+2.2
MuSiQue	F1	26.5	28.9	+2.4
Ablation studies confirm the importance of history-aware states.
HotpotQA	F1	49.1	51.4	+2.3

Main Takeaways

Reinforcement fine-tuning of the retriever significantly improves end-to-end RAG performance compared to fixed retrievers or LLM-only adaptation.
History-aware state representations are crucial for multi-hop reasoning; omitting history degrades performance back to near-baseline levels.
The method is robust across different retriever backbones (Contriever, BGE) and LLM sizes.
Stochastic sampling (Plackett-Luce) combined with GRPO effectively bridges the gap between deterministic retrieval and RL optimization.

📚 Prerequisite Knowledge

Prerequisites

Markov Decision Processes (MDP)
Reinforcement Learning (Policy Gradient)
Dense Retrieval
Retrieval-Augmented Generation (RAG)

Key Terms

HARR: History-Aware Reinforced Retriever—the proposed framework

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple sampled outputs for the same input, removing the need for a value function

Plackett-Luce model: A probability distribution for ranking items, used here to sample ordered lists of documents stochastically

state aliasing: A situation in RL where different environment states appear identical to the agent (e.g., same query but different history), preventing optimal decision making

sparse terminal reward: A reward signal received only at the end of the episode (final answer accuracy), with no intermediate feedback

sub-query: An intermediate search query generated by the LLM during multi-hop reasoning

dense retriever: A retrieval model that uses vector embeddings to find relevant documents