Evaluation Setup
Comparison of RAG (Retrieval-Augmented Generation) vs LC (Long Context) across multiple models and lengths
Benchmarks:
- LaRA (Long-context QA (Location, Reasoning, Comparison, Hallucination)) [New]
Metrics:
- Accuracy (judged by GPT-4o)
- Statistical methodology: Cohen's Kappa coefficient calculated to verify agreement between LLM judge and human annotators
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Model Strength Analysis: RAG helps weaker models significantly more than strong models. |
| LaRA (128k context) |
Accuracy |
Not reported in the paper |
Not reported in the paper |
+38.12
|
| LaRA (128k context) |
Accuracy |
Not reported in the paper |
Not reported in the paper |
+6.48
|
| Context Length Analysis: The advantage shifts from LC to RAG as context length increases. |
| LaRA (32k context) |
Average Accuracy |
Not reported in the paper |
Not reported in the paper |
-2.4
|
| LaRA (128k context) |
Average Accuracy |
Not reported in the paper |
Not reported in the paper |
+3.68
|
Main Takeaways
- Optimal choice depends on model size: Weaker models benefit heavily from RAG, while strong models (GPT-4o, Claude-3.5) often perform better with full LC.
- Context length matters: LC is superior at shorter lengths (32k), but RAG regains the advantage at very long lengths (128k) due to the 'lost-in-the-middle' phenomenon in LC.
- Task type is critical: LC excels at reasoning and comparison (integrating information), while RAG is superior at hallucination detection (identifying when info is missing).
- RAG performs comparably to LC on simple 'single-location' retrieval tasks.