RAFT: Adapting language model to domain specific RAG

📝 Paper Summary

Modularized RAG pipeline Domain adaptation

RAFT adapts language models to domain-specific RAG by fine-tuning them on a mix of relevant and distractor documents, teaching the model to ignore irrelevant context and cite evidence.

Core Problem

Existing methods either fine-tune models on domain data (ignoring test-time retrieval imperfections) or use RAG with generic models (failing to learn domain style/patterns), leading to poor performance when distractors are present.

Why it matters:

LLMs are increasingly used in specialized domains (legal, medical, enterprise) where general knowledge is less critical than maximizing accuracy on specific documents
Standard fine-tuning (DSF) often encourages memorization rather than reasoning from context, failing open-book exams
RAG-based in-context learning fails to leverage the learning opportunity afforded by the fixed domain setting

Concrete Example: When asked 'Who is the screenwriter of [Movie]?', a standard fine-tuned model might hallucinate a famous movie written by that person instead of naming the person, whereas RAFT correctly identifies the name by citing the provided document.

Key Novelty

Retrieval Augmented Fine Tuning (RAFT)

Prepares fine-tuning data where each sample contains a question, a set of documents (including 'golden' relevant ones and 'distractor' irrelevant ones), and a chain-of-thought answer citing the text
Intentionally removes the 'golden' document in a subset of training data (P% of the time) to force the model to memorize some answers, while learning to extract answers from context in others
Trains the model to explicitly ignore distractor documents that do not help answer the question, simulating imperfect retrieval at test time

Architecture

The high-level design principle for RAFT, illustrating the data preparation process where questions are paired with documents (some golden, some distractor) to train the model.

Evaluation Highlights

+35.25% improvement on HotpotQA compared to Llama2-7B-chat with RAG
+76.35% improvement on Torch Hub (API documentation) compared to Llama2-7B-chat with RAG
RAFT outperforms domain-specific fine-tuning (DSF) + RAG by significant margins (e.g., +30.87% on HotpotQA), proving that standard fine-tuning is insufficient for RAG robustness

Breakthrough Assessment

8/10

Simple yet highly effective data construction recipe that bridges the gap between fine-tuning and RAG. Addresses a critical robustness issue (distractors) in practical RAG deployments.

⚙️ Technical Details

Problem Definition

Setting: Domain-specific open-book question answering

Inputs: Question Q and a set of retrieved documents D (containing relevant D* and distractors D_k)

Outputs: Answer A* with chain-of-thought reasoning and citations

Pipeline Flow

Data Preparation (Construct Q + D + A triplets with CoT and citations)
Distractor Injection (Add irrelevant documents to context)
Fine-tuning (Train model on mixed dataset)
Inference (RAG retrieves top-k, RAFT model generates answer)

System Modules

Data Generator

Create training samples with questions, mixed documents (golden + distractors), and CoT answers

Model or implementation: GPT-4-1106 (used to generate CoT reasoning and citations)

RAFT Model

Generate answers citing specific segments from the provided context while ignoring distractors

Model or implementation: Llama-2-7B-chat (fine-tuned)

Novel Architectural Elements

Training data composition strategy: Mixing samples with and without 'golden' documents to balance memorization vs. extraction
Integration of distractors directly into the fine-tuning stage to simulate test-time RAG noise

Modeling

Base Model: Llama-2-7B-chat

Training Method: Supervised Fine-Tuning (SFT) on constructed RAFT dataset

Objective Functions:

Purpose: Minimize the difference between generated answer and ground truth CoT answer.

Formally: Standard causal language modeling loss (Cross-Entropy).

Adaptation: Full fine-tuning

Training Data:

Positive samples (P%): Q + D* (golden) + D_distractors -> A*
Negative samples (1-P%): Q + D_distractors -> A* (forcing memorization/answering without context if answer is known, or stating unanswerable)
Includes Chain-of-Thought reasoning and citations in targets

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 16
epochs: 3

Compute: Not reported in the paper

Comparison to Prior Work

vs. DSF: RAFT includes retrieved documents (including distractors) in training, teaching the model to extract rather than just memorize
vs. Llama2 + RAG: RAFT aligns the model's answering style and improves robustness to irrelevant documents via fine-tuning
vs. LUMOS [not cited in paper]: LUMOS separates RAG into modular planning/grounding steps; RAFT integrates robustness into a single generation model via data augmentation

Limitations

Relies on GPT-4 to generate high-quality Chain-of-Thought training data
PubMed QA improvements were marginal compared to DSF+RAG, possibly due to binary (yes/no) nature of the task
Optimal percentage of golden documents (P) varies by dataset and requires tuning

Reproducibility

Code: https://github.com/ShishirPatil/gorilla

Code available at https://github.com/ShishirPatil/gorilla. Paper describes data construction logic (P% hyperparameter) and lists datasets (PubMed, HotpotQA, APIBench). Uses GPT-4 for generating CoT data.

📊 Experiments & Results

Evaluation Setup

Domain-specific open-book QA where models are provided with retrieved documents (including distractors) at test time.

Benchmarks:

HotpotQA (Wikipedia-based QA)
PubMed QA (Biomedical QA)
Gorilla APIBench (HuggingFace, Torch Hub, TensorFlow Hub) (Code/API generation from documentation)
Natural Questions (NQ) (Wikipedia-based QA)
Trivia QA (Wikipedia-based QA)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison results showing RAFT superior performance against baselines across multiple datasets.
HotpotQA	Accuracy	28.18	63.43	+35.25
HotpotQA	Accuracy	32.56	63.43	+30.87
Torch Hub	Accuracy	20.59	96.94	+76.35
HuggingFace	Accuracy	43.18	74.59	+31.41
HotpotQA	Accuracy	57.78	63.43	+5.65

Experiment Figures

Ablation study on the percentage of training data (P) that includes the golden document.

Effect of the number of distractor documents during training on test-time performance.

Main Takeaways

RAFT consistently outperforms both standard Llama2 with RAG and Domain-Specific Fine-tuning (DSF) with RAG.
Including distractor documents during training significantly improves robustness at test time compared to training only with golden documents.
The optimal percentage of training data containing 'golden' documents (P) varies (40-100%) but is often less than 100%, suggesting some forced memorization or 'negative' training helps.
Chain-of-Thought (CoT) style responses in training data enhance reasoning capabilities and robustness.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Supervised Fine-Tuning (SFT)
Chain-of-Thought (CoT) prompting

Key Terms

RAFT: Retrieval Augmented Fine Tuning—a training recipe that fine-tunes models to distinguish between relevant and irrelevant retrieved documents

Distractor documents: Retrieved documents that do not contain the answer to the query, used to test or train model robustness

Golden documents: Documents that contain the ground truth answer to the query

DSF: Domain-Specific Fine-tuning—standard supervised fine-tuning on domain data without the specific retrieval-augmented structure

Chain-of-Thought: A prompting/training technique where the model generates intermediate reasoning steps before the final answer

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

SFT: Supervised Fine-Tuning—training a pre-trained model on a labeled dataset