Evaluation Setup
Retrieval of text spans from legal documents
Benchmarks:
- LegalBench-RAG (Legal Retrieval) [New]
- LegalBench-RAG-mini (Legal Retrieval) [New]
Metrics:
- Recall (implied importance for retrieval)
- Precision (implied importance for minimal snippets)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| The paper presents dataset statistics rather than model performance results. |
| LegalBench-RAG |
Number of QA Pairs |
Not applicable |
6858 |
Not applicable
|
| LegalBench-RAG |
Corpus Size (Characters) |
Not applicable |
79969726 |
Not applicable
|
| LegalBench-RAG-mini |
Number of QA Pairs |
Not applicable |
776 |
Not applicable
|
Main Takeaways
- Provides the first specialized benchmark for evaluating retrieval in legal RAG systems
- Enables assessment of precise snippet retrieval, crucial for minimizing costs and hallucinations in legal AI
- Re-purposes high-quality, expert-annotated data from existing reasoning benchmarks (LegalBench) for retrieval tasks
- Estimates the cost of replicating the underlying annotations (e.g., CUAD) at ~$2,000,000, highlighting the value of leveraging existing expert data