← Back to Paper List

Legalbench-rag: A benchmark for retrieval-augmented generation in the legal domain

N Pipitone, GH Alami
arXiv, 8/2024 (2024)
RAG Benchmark QA

📝 Paper Summary

Benchmark datasets Modularized RAG pipeline
LegalBench-RAG converts the LegalBench reasoning dataset into a dedicated retrieval benchmark by tracing context clauses back to their original locations in a 79M-character legal corpus.
Core Problem
Existing legal AI benchmarks like LegalBench evaluate generation/reasoning given a context, but ignore the retrieval step; conversely, general RAG benchmarks lack legal domain nuances.
Why it matters:
  • Legal documents have unique structures and terminologies that general benchmarks cannot adequately assess
  • Retrieving large, imprecise chunks increases processing costs and hallucination risks, whereas legal applications require precise snippets
  • There is a critical gap in evaluating the retrieval component of RAG systems specifically for the legal sector
Concrete Example: In LegalBench, a task asks if a clause grants a license. The input is the clause itself. In a real RAG scenario (and LegalBench-RAG), the system must first find that specific clause within a full agreement (e.g., 'Cardlytics Maintenance Agreement') based on a query like 'Are the licenses granted under this contract non-transferable?'
Key Novelty
Reverse-engineering a retrieval benchmark from a reasoning benchmark
  • Constructs a retrieval dataset by tracing the context snippets used in LegalBench queries back to their exact character spans in the original source documents
  • Emphasizes precise retrieval of minimal, highly relevant text segments rather than broad document retrieval or large chunks
  • Introduces a lightweight 'mini' version for rapid iteration alongside the full 6,858-pair dataset
Evaluation Highlights
  • Dataset comprises 6,858 query-answer pairs over a corpus of 79M characters (714 documents)
  • Includes a 'mini' version with 776 queries for rapid experimentation
  • Annotations derived from expert-labeled source datasets (PrivacyQA, CUAD, MAUD, ContractNLI) and manually verified
Breakthrough Assessment
7/10
Significant contribution as the first dedicated legal RAG retrieval benchmark. It fills a clear gap, though it repurposes existing data rather than generating novel legal queries from scratch.
×