Domain Adaptation for Conversational Query Production with the RAG Model Feedback

📝 Paper Summary

Modularized RAG pipeline

DAMF adapts a search query generator to new domains without human labels by using retrieval scores from a trained RAG model as reinforcement learning rewards.

Core Problem

Search query producers trained on one domain struggle in new domains, but obtaining human annotations for every new domain is costly.

Why it matters:

Existing weak supervision methods (using BM25 scores as rewards) fail when commercial search engines return noisy web pages (ads/irrelevant info)
Conversations vary significantly; some turns don't require external knowledge, and forcing query generation on these turns hurts model performance
Commercial search engines are often black boxes, making direct gradient propagation impossible, necessitating robust reinforcement learning approaches

Concrete Example: In a dialogue about 'Miyun Reservoir', the conversation shifts to 'Tianzhuang Reservoir'. A BM25-based system incorrectly scores queries about the old topic highly due to word overlap. The proposed RAG-based feedback correctly identifies 'Tianzhuang Reservoir' as the better query because it retrieves documents that actually help generate the correct response.

Key Novelty

Domain Adaptation with Model Feedback (DAMF)

Replaces surface-level BM25 rewards with deep semantic feedback from a trained Retrieval-Augmented Generation (RAG) model to guide the query producer
Filters training instances where generated queries are indistinguishable or low-quality, preventing the model from learning from noise or unnecessary search turns
Uses knowledge distillation from the source-domain model to regularize training, ensuring the policy doesn't drift too far from a good initialization

Architecture

The 3-stage domain adaptation workflow: (1) Generating candidate queries and retrieving documents (offline cache); (2) Training a RAG model on target data to learn document relevance; (3) Tuning the query producer via RL using RAG scores as rewards.

Evaluation Highlights

+3.17 Unigram F1 improvement over strong Self-Training baselines on the noisy DuSinc -> KdConv domain adaptation setting
Significantly outperforms BM25-based feedback methods (WSMF), improving R@1 by ~2.4% in clean settings and Unigram F1 by ~3.2% in noisy settings
Achieves higher query quality than 8-shot in-context learning with text-davinci-003 (GPT-3.5) on target domains

Breakthrough Assessment

7/10

Solid improvement over existing weak supervision for query generation. Replacing BM25 with RAG feedback is logical and effective, though the framework relies on existing RL and KD techniques.

⚙️ Technical Details

Problem Definition

Setting: Conversational Query Production: Given dialogue history X<t, generate a search query q to retrieve documents Kq that aid in generating response ut.

Inputs: Concatenated dialogue history X<t = u1, ..., u_{t-1}

Outputs: Search query q

Pipeline Flow

Offline Cache Creation: Generate candidate queries using source model & extract keywords
RAG Training: Train a RAG model on target domain using cached documents to learn document relevance
RL Adaptation: Fine-tune query producer using RAG retrieval scores as rewards, regularized by source model

System Modules

Query Producer

Generates search queries based on dialogue context

Model or implementation: T5-base (English) / Mengzi-T5-base (Chinese)

RAG Model (Retriever + Generator)

Evaluates the quality of candidate queries by estimating how useful retrieved documents are for generating the ground-truth response

Model or implementation: T5-base (English) / Mengzi-T5-base (Chinese) with shared encoder

Novel Architectural Elements

Use of a trained RAG model's retriever score as a semantic reward signal for RL-based query generation adaptation
Filtering mechanism based on reward spread (beta-filtering) and absolute threshold (alpha-filtering) to identify noisy or unnecessary query instances

Modeling

Base Model: T5-base (English), Mengzi-T5-base (Chinese)

Training Method: REINFORCE with RAG-based feedback and Knowledge Distillation regularization

Objective Functions:

Purpose: Maximize expected reward (query quality).

Formally: L_rl = - Δ(r, r_bar) * log p(q | X<t)
Purpose: Regularize adaptation to prevent catastrophic forgetting of source domain knowledge.

Formally: L_kd = - log p(q_hat | X<t) where q_hat is generated by source model

Training Data:

Source: Wizard-of-Internet (En), DuSinc (Zh)
Target: Wizard-of-Wikipedia (En), KdConv (Zh), DuConv (Zh)
Offline cache created by beam search (N=5) on source model + keyword extraction

Key Hyperparameters:

learning_rate: 5e-5
batch_size_adaptation: 256
candidate_queries_N: 5
+ 5 more
top_k_documents: 5
lambda_kd_weight: 0.1
alpha_threshold: 0.4 (RAG) / 20 (BM25)
beta_spread_threshold: 0.2 (RAG) / 5 (BM25)
gamma_confidence_threshold: 0.6 (RAG) / 30 (BM25)

Compute: Not reported in the paper

Comparison to Prior Work

vs. WSMF: Replaces BM25 (surface overlap) with RAG model scores (semantic relevance) for RL rewards
vs. Self-Training: Uses reinforcement learning with quality-based filtering rather than just supervised learning on pseudo-labels
vs. WebGPT/Sparrow [not cited in paper]: Uses model-based feedback (RAG) instead of human feedback (RLHF) to reduce costs

Limitations

RAG model retrieval is slower due to cross-encoder structure compared to bi-encoder
Requires an offline cache of documents which is time-consuming to create
Relies on the assumption that source domain model provides a decent initialization
Performance gain in clean settings (Wikipedia) is marginal compared to noisy settings

Reproducibility

Code: https://github.com/DeepLearnXMU/DAMF

Publicly available code (https://github.com/DeepLearnXMU/DAMF). Datasets (WoI, WoW, DuSinc, KdConv, DuConv) are publicly available. Annotations for KdConv/DuConv test sets provided by authors. Offline cache generation required before training.

📊 Experiments & Results

Evaluation Setup

Unsupervised Domain Adaptation: Train on Source, Adapt to Target without labels. Tested in 'Clean' (Wikipedia) and 'Noisy' (Commercial Search Engine) settings.

Benchmarks:

Wizard-of-Internet -> Wizard-of-Wikipedia (Clean Setting (En))
DuSinc -> KdConv (Noisy Setting (Zh))
DuSinc -> DuConv (Noisy Setting (Zh))

Metrics:

Recall@K (R@1, R@3, R@5) for clean setting (document hit rate)
Unigram F1 for generated queries
BLEU-1/2 for generated queries
ROUGE-1/2/L for generated queries
Perplexity (PPL) for downstream response generation
Statistical methodology: Average results of 3 runs reported. Significance testing mentioned ('significantly better') but specific test/p-values not detailed in text.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results in the 'Clean' setting (WoI -> WoW) show modest gains, as BM25 is already a strong signal for Wikipedia documents.
Wizard-of-Wikipedia	R@1	67.63	68.34	+0.71
Results in the 'Noisy' setting (DuSinc -> KdConv) show substantial gains, proving robustness against commercial search engine noise.
KdConv	Unigram F1	64.27	67.44	+3.17
KdConv	BLEU-1	63.71	65.98	+2.27
DuConv	Unigram F1	68.82	72.32	+3.50
Ablation studies demonstrate the contribution of Knowledge Distillation and RL components.
KdConv	Unigram F1	66.89	67.44	+0.55
KdConv	Unigram F1	65.49	67.44	+1.95

Experiment Figures

Comparison of R@K scores using BM25 vs RAG feedback across dialogue turns.

Main Takeaways

RAG-based feedback significantly outperforms BM25 feedback in 'Noisy' settings (commercial search engines), where semantic matching is crucial.
Filtering strategies (alpha/beta filtering) effectively remove instances that don't need external knowledge or are ambiguous, stabilizing training.
Knowledge Distillation regularization prevents the model from forgetting the source domain and stabilizes the RL process.
The method outperforms Large Language Models (text-davinci-003) in few-shot settings for this specific task.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (REINFORCE algorithm)
Retrieval-Augmented Generation (RAG)
Knowledge Distillation
Domain Adaptation

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents, then generating responses based on what they find

BM25: Best Matching 25—a ranking function used by search engines to estimate the relevance of documents to a given search query based on keyword matching

REINFORCE: A basic policy gradient reinforcement learning algorithm that updates model parameters to maximize expected rewards

Knowledge Distillation: A technique where a smaller or target model is trained to reproduce the behavior (output probabilities) of a larger or source model

Bi-encoder: A retrieval architecture that encodes query and document separately into vectors

Cross-encoder: A retrieval architecture that processes query and document simultaneously to capture deeper interactions, used here for the RAG retriever

T5: Text-to-Text Transfer Transformer—a pre-trained language model that treats all NLP tasks as a text generation problem

PPL: Perplexity—a measurement of how well a probability model predicts a sample; lower values indicate better performance

MIPS: Maximum Inner Product Search—an algorithm used to quickly find the vector in a database that is most similar (highest dot product) to a query vector