RAG vs Fine-tuning: pipelines, tradeoffs, and a case study on agriculture

📝 Paper Summary

Modularized RAG pipeline Benchmark

The paper demonstrates that combining Retrieval-Augmented Generation (RAG) with fine-tuning cumulatively improves accuracy in domain-specific tasks, using a new pipeline for generating industrial agricultural datasets.

Core Problem

General-purpose Large Language Models (LLMs) lack specific, localized knowledge required for specialized industries like agriculture, and the trade-offs between RAG and fine-tuning for addressing this are poorly understood.

Why it matters:

Farmers require location-specific advice (e.g., planting times vary by state) that general models often miss or hallucinate
The industry lacks high-quality, structured training data due to information being locked in complex PDF formats
Developers need guidance on whether to invest in RAG, fine-tuning, or both for industrial applications

Concrete Example: When asked 'What is the best time to plant trees in Arkansas?', GPT-4 gives a generic answer. An expert gives specific months (Spring/Fall). A fine-tuned model leverages cross-geography knowledge to increase answer similarity to the expert from 47% to 72%.

Key Novelty

End-to-End Industrial Data Generation & Optimization Pipeline

Proposes a comprehensive pipeline that extracts structure from PDFs (not just text), generates synthetic Q&A pairs using GPT-4, and uses these to fine-tune models
Conducts a direct quantitative comparison of RAG, Fine-Tuning, and RAG + Fine-Tuning strategies specifically for the agriculture domain across multiple geographies (USA, Brazil, India)

Architecture

The end-to-end pipeline for dataset generation, model training, and evaluation

Evaluation Highlights

Fine-tuning Llama2-13B increases accuracy by over 6 percentage points compared to the base model on agricultural queries
Combining RAG with fine-tuning yields a cumulative effect, increasing accuracy by a further 5 percentage points (total >11 p.p. gain)
Fine-tuning enables the model to leverage information across geographies, increasing answer similarity from 47% to 72% in specific experiments

Breakthrough Assessment

7/10

Provides a valuable, rigorous empirical study on the additive benefits of RAG and fine-tuning in a specific vertical (agriculture), though the underlying techniques (LoRA, RAG) are standard.

⚙️ Technical Details

Problem Definition

Setting: Domain-specific Question Answering using proprietary/external datasets

Inputs: Natural language question about agriculture (optionally with location context)

Outputs: Answer generated by LLM, grounded in specific documents

Pipeline Flow

Data Acquisition (Web scraping)
PDF Extraction (GROBID)
Q&A Generation (Guidance + GPT-4)
RAG / Fine-Tuning (Model adaptation)
Evaluation (GPT-4 based metrics)

System Modules

PDF Extractor (Data Processing)

Convert unstructured PDFs into structured TEI/JSON, preserving section hierarchy

Model or implementation: GROBID

Q&A Generator (Data Processing)

Synthesize question-answer pairs from document sections for training

Model or implementation: GPT-4 (controlled via Guidance framework)

Retriever (Inference)

Find relevant text chunks for a given query

Model or implementation: Sentence Transformers + FAISS

Generator (Fine-tuned) (Inference)

Generate final answer using query + optional retrieved context

Model or implementation: Llama2-13B / GPT-4

Modeling

Base Model: Llama2-13B, GPT-4, Llama2-7B, Open-Llama-3b

Training Method: Supervised Fine-Tuning (SFT) using LoRA (for GPT-4) and FSDP (for Llama models)

Adaptation: LoRA (Low-Rank Adaptation) for GPT-4; Full parameter or FSDP for Llama models

Trainable Parameters: Attention modules (LoRA) or Full model (FSDP)

Training Data:

Synthetic Q&A pairs generated from USDA (USA), Embrapa (Brazil), and KVK (India) documents
Dataset includes >23k PDF files for USA, specialized Q&A books for Brazil, and farmer queries for India

Key Hyperparameters:

batch_size: 128 (effective, Llama), 256 (GPT-4)
learning_rate: 2e-5 (Llama), 1e-4 (GPT-4)
epochs: 4
+ 3 more
optimizer: Adam
scheduler: Cosine with linear warmup (4% steps)
precision: BFloat16 (AMP)

Compute: Llama training: 8 H100 GPUs. GPT-4 Fine-tuning: 7 nodes with 8 A100 GPUs each (1.5 days).

Comparison to Prior Work

vs. Standard RAG: Incorporates fine-tuning to bake in domain vocabulary and style
vs. Standard Fine-Tuning: Adds retrieval to ground answers in specific, up-to-date documents
vs. Generic Crawling [not cited in paper]: Uses structured PDF extraction (GROBID) to preserve document hierarchy rather than simple text extraction

Limitations

High cost of fine-tuning GPT-4 (requires significant GPU resources)
Evaluation relies heavily on GPT-4 as a judge, which may have biases
Study focuses primarily on agriculture; generalization to other industries is implied but not tested
Dependence on the quality of the initial PDF extraction and synthetic data generation

Reproducibility

Code availability is not provided. Datasets are sourced from public agencies (USDA, Embrapa, data.gov.in) but the specific processed Q&A pairs and trained weights are not explicitly linked.

📊 Experiments & Results

Evaluation Setup

Open-ended Question Answering evaluated by GPT-4 based metrics

Benchmarks:

Washington State Agriculture (Domain-specific QA) [New]
Embrapa (Brazil) (Domain-specific QA) [New]
KVK (India) (Farmer Query QA) [New]

Metrics:

Accuracy (GPT-4 assessed)
Relevance (GPT-4 assessed)
Answer Similarity (to expert)
Correctness
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results demonstrating the cumulative benefits of Fine-Tuning and RAG on accuracy.
Washington State Agriculture	Accuracy (Percentage Points Increase)	Not explicitly reported in the paper	Not explicitly reported in the paper	+6
Washington State Agriculture	Accuracy (Percentage Points Increase)	Not explicitly reported in the paper	Not explicitly reported in the paper	+5
Geographic generalization experiment showing fine-tuning helps leverage cross-location knowledge.
Location-specific Queries	Answer Similarity	47%	72%	+25%

Main Takeaways

RAG and Fine-Tuning are additive: Fine-tuning improves the model's baseline understanding of the domain, while RAG provides specific, context-relevant facts
Structured data extraction from PDFs is critical for generating high-quality synthetic training data
GPT-4 consistently outperforms smaller models, but fine-tuned smaller models (Llama2-13B) show significant improvements
Spatially-scoped fine-tuning allows models to better answer location-specific questions by learning patterns across geographies

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG architectures
Familiarity with Fine-Tuning techniques (LoRA, QLoRA)
Knowledge of PDF structure extraction challenges

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Fine-Tuning: The process of training a pre-trained model on a smaller, specific dataset to adapt it to a particular task or domain

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

FSDP: Fully Sharded Data Parallelism—a training method that shards model parameters across GPUs to reduce memory usage

GROBID: GeneRation Of BIbliographic Data—a machine learning library for extracting structured data (metadata, sections) from scientific PDF documents

TEI: Text Encoding Initiative—a standard format for representing texts in digital form, used here for structured PDF output

FAISS: Facebook AI Similarity Search—a library for efficient similarity search and clustering of dense vectors

BM25: Best Matching 25—a ranking function used in information retrieval to estimate the relevance of documents to a search query based on keyword matching

F1 score: A metric balancing precision and recall

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization and translation

BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of text which has been machine-translated from one natural language to another

Guidance framework: A programming paradigm that controls LLM generation by enforcing specific structures on inputs and outputs

Cosine learning rate scheduler: A method to adjust the learning rate during training following a cosine curve

Flash-attention: An algorithm that speeds up attention computation and reduces memory usage in Transformers

p.p.: Percentage points—the arithmetic difference between two percentages