Ragfoundry: A framework for enhancing llms for retrieval augmented generation

📝 Paper Summary

Modularized RAG pipeline RAG Frameworks

RAG Foundry is an open-source framework that integrates data creation, training, inference, and evaluation into a single workflow to facilitate rapid prototyping and fine-tuning of LLMs for RAG use cases.

Core Problem

Implementing RAG systems is complex due to intricate design decisions (retrieval algorithms, prompt design) and the difficulty of ensuring reproducibility and evaluation across disjoint workflows.

Why it matters:

Design decisions like text embedding, indexing, and prompting significantly impact performance but are hard to optimize in isolation
Reproducibility is difficult due to variations in preprocessing, data, and hardware across different experiments
Existing tools (LangChain, LlamaIndex) focus on inference pipelines but lack integrated, robust support for training and comprehensive RAG-specific evaluation

Concrete Example: A researcher wanting to test if Chain-of-Thought (CoT) improves a specific QA task currently has to manually curate data, run a separate retrieval process, format prompts, fine-tune a model using a different library, and then run a separate evaluation script. RAG Foundry unifies these steps into a single configuration-driven flow.

Key Novelty

Integrated RAG Experimentation Framework

Unifies the typically disjoint stages of RAG development (dataset creation, training, inference, evaluation) into a single library controlled by configuration files
Introduces a modular 'processing' pipeline that persists RAG interactions (retrieval results, prompt templates) into a fixed dataset format, ensuring training and inference are perfectly aligned

Architecture

Overview of the RAG Foundry framework workflow and modules

Evaluation Highlights

Fine-tuning Llama-3-8B-Instruct with RAG interactions improves Exact Match on TriviaQA from 0.722 (Baseline) to 0.916 (RAG-sft)
Using Chain-of-Thought (CoT) fine-tuning on Phi-3-mini increases STR-EM on ASQA from 0.109 (Baseline) to 0.386 (CoT-sft)
Framework effectively demonstrates that different datasets require different RAG strategies; CoT degrades performance on TriviaQA (-0.083 EM vs RAG-sft) but improves it on ASQA (+0.134 STR-EM vs RAG-sft) for Llama-3

Breakthrough Assessment

7/10

While not introducing a new model architecture, it provides a significant engineering contribution by unifying the fractured RAG development workflow, enabling systematic comparison of techniques.

⚙️ Technical Details

Problem Definition

Setting: Enhancing Large Language Models for Retrieval-Augmented Generation tasks through systematic data augmentation, fine-tuning, and evaluation

Inputs: Raw datasets, external knowledge bases, and configuration files defining processing steps

Outputs: Processed datasets, fine-tuned RAG models, and evaluation metrics (local and global)

Pipeline Flow

Data Creation (Loading → Augmentation/Retrieval → Prompt Construction)
Training (Fine-tuning on augmented data)
Inference (Generating predictions using fine-tuned models)
Evaluation (Computing metrics on predictions)

System Modules

Data Creation

Augment datasets with retrieved context and format them for training/inference

Model or implementation: Configurable (supports HuggingFace datasets, Haystack retrievers)

Training

Fine-tune the LLM on the processed RAG dataset

Model or implementation: Supports HuggingFace models (e.g., Llama-3, Phi-3)

Inference

Generate answers using the trained model and processed prompts

Model or implementation: Fine-tuned LLM

Evaluation

Compute performance metrics on generated outputs

Model or implementation: Various evaluator models (e.g., GPT-4 for RAGAS, BERT for BERTScore)

Novel Architectural Elements

Unified configuration-driven workflow linking data persistence, training, and evaluation
Abstraction of 'Global' vs 'Local' processing steps allowing complex data logic (e.g., join operations alongside retrieval) within a single pipeline config

Modeling

Base Model: Experiments used Llama-3-8B-Instruct and Phi-3-mini-128k-instruct

Training Method: Supervised Fine-Tuning (SFT) with LoRA

Adaptation: LoRA (rank=16, alpha=16, dropout=0.1)

Trainable Parameters: q_proj, k_proj, v_proj (Phi-3); q_proj, v_proj (Llama-3)

Training Data:

TriviaQA: 6000 train / 1000 eval
ASQA: 4353 train / 948 eval
PubmedQA: 10000 train / 500 eval

Key Hyperparameters:

learning_rate: 2e-05 (Phi-3 config example) / 1e-4 (Experiments)
batch_size: 1
num_train_epochs: 1
+ 3 more
gradient_accumulation_steps: 4
lr_scheduler_type: cosine
optimizer: paged_adamw_8bit

Compute: Not reported in the paper

Comparison to Prior Work

vs. LlamaIndex/LangChain: RAG Foundry focuses on the *training* and *evaluation* cycle of RAG models, whereas others focus primarily on inference chain construction
vs. RaLLe: RaLLe does not include training capabilities
vs. FlashRAG: RAG Foundry emphasizes extensibility via custom pipeline components and configuration-driven workflows rather than pre-packaged RAG implementations
+ 1 more
vs. DSPy [not cited in paper]: DSPy optimizes prompts programmatically; RAG Foundry provides a broader infrastructure for fine-tuning weights alongside data processing

Limitations

Demonstrated on a limited subset of tasks (QA) and datasets
Specific complex workflows might require code changes despite generalizability
Faithfulness and Relevancy metrics (RAGAS) did not correlate well with main metrics (EM/F1), suggesting evaluation challenges remain

Reproducibility

Code: https://github.com/IntelLabs/RAGFoundry

Code is publicly available at https://github.com/IntelLabs/RAGFoundry. The paper provides detailed configuration examples (YAML) for all modules. Datasets used are public (TriviaQA, ASQA, PubmedQA). Evaluation metrics implementation details are provided.

📊 Experiments & Results

Evaluation Setup

Knowledge-intensive question answering using retrieved context

Benchmarks:

TriviaQA (Reading Comprehension / QA)
ASQA (Long-form Factoid QA)
PubmedQA (Biomedical QA)

Metrics:

Exact Match (EM)
F1 Score
STR-EM (String Exact Match)
Accuracy
Faithfulness (RAGAS)
Relevancy (RAGAS)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Llama-3-8B results showing the impact of RAG, Fine-tuning, and Chain-of-Thought across datasets.
TriviaQA	EM	0.722	0.916	+0.194
ASQA	STR-EM	0.200	0.422	+0.222
PubmedQA	Accuracy	0.560	0.770	+0.210
Phi-3-mini results demonstrating similar trends, confirming the framework's utility across model sizes.
TriviaQA	EM	0.630	0.923	+0.293
ASQA	STR-EM	0.109	0.386	+0.277

Main Takeaways

Fine-tuning with RAG (RAG-sft) consistently improves performance over baselines across all datasets and models
Chain-of-Thought (CoT) fine-tuning (CoT-sft) is highly effective for ASQA but degrades performance on TriviaQA compared to standard RAG fine-tuning, highlighting the need for experiment frameworks to select the right strategy per task
RAGAS metrics (Faithfulness/Relevancy) often do not correlate with main metrics (EM/F1), suggesting they capture different trade-offs in generation quality

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG) components
Familiarity with LLM fine-tuning techniques (e.g., LoRA)
Knowledge of standard NLP evaluation metrics

Key Terms

RAG: Retrieval-Augmented Generation—enhancing LLMs by providing external documents during generation

LoRA: Low-Rank Adaptation—an efficient fine-tuning method that updates only a small subset of parameters

CoT: Chain-of-Thought—a prompting technique encouraging models to reason step-by-step before answering

STR-EM: String Exact Match—a metric measuring if the exact answer string appears in the generation

RAGAS: A framework for reference-free evaluation of RAG systems using metrics like Faithfulness and Relevancy

Exact Match (EM): A metric checking if the generated answer is character-for-character identical to the ground truth

Faithfulness: A metric measuring if the generated answer is factually consistent with the retrieved context

Relevancy: A metric measuring if the generated answer actually addresses the query

TRL: Transformer Reinforcement Learning—a library often used for SFT and RLHF, utilized here for the training module