← Back to Paper List

LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs--No Silver Bullet for LC orRAGRouting

K Li, L Zhang, Y Jiang, P Xie, F Huang, S Wang…
Alibaba Group, The Pennsylvania State University
arXiv, 2/2025 (2025)
RAG Benchmark Factuality QA

📝 Paper Summary

Benchmark datasets Modularized RAG pipeline
LaRA is a benchmark comparing Retrieval-Augmented Generation (RAG) against Long-Context (LC) LLMs using naturally occurring texts and practical tasks to determine when each approach is optimal.
Core Problem
Existing benchmarks comparing RAG and LC suffer from insufficient context lengths, data leakage, unreasonable metrics (like F1/EM), and unrealistic truncation, leading to contradictory conclusions about which method is superior.
Why it matters:
  • Practitioners lack clear guidelines on whether to use costly RAG pipelines or newer 128k+ context windows for specific applications
  • Current evaluations often use truncated texts or artificial datasets, obscuring the true 'lost-in-the-middle' or hallucination tendencies of modern models
  • Conflicting studies (e.g., Xu et al. vs Li et al.) create confusion about the necessity of RAG in the era of long-context models
Concrete Example: In ∞-bench, contexts exceeding 128k tokens are truncated in the middle, often removing the answer entirely. A model failing to answer is penalized for capacity rather than reasoning. LaRA avoids this by ensuring texts fit within standard 32k/128k windows without truncation.
Key Novelty
LaRA (Long-context vs. RAG Analysis) Benchmark
  • Uses naturally occurring long texts (novels, papers, financial reports) fitting standard windows (32k/128k) to avoid artificial truncation or concatenation
  • Mitigates data leakage by using recent 2024 documents and replacing entities in older novels using GPT-4o consistent rewriting
  • employs 'LLM-as-a-judge' with high human agreement (Cohen's Kappa) instead of unreliable n-gram metrics like F1 or Exact Match
Evaluation Highlights
  • RAG outperforms Long-Context (LC) by 38.12% accuracy on weaker models (Mistral-Nemo-12B) at 128k length, but LC wins on strong models (GPT-4o)
  • At 128k context length, RAG generally outperforms LC by 3.68% on average across models, reversing the trend seen at 32k where LC led by 2.4%
  • LC excels in reasoning and comparison tasks, while RAG shows significant advantages in detecting hallucinations (refusing to answer)
Breakthrough Assessment
8/10
Provides a much-needed, rigorously designed benchmark that resolves conflicting narratives in the field. The focus on 'natural' lengths and leakage prevention makes it highly practical.
×