← Back to Paper List

The Factuality of Large Language Models in the Legal Domain

Rajaa El Hamdani, Thomas Bonald, Fragkiskos Malliaros, Nils Holzenberger, Fabian Suchanek
Télécom Paris, Institut Polytechnique de Paris, CentraleSupélec, University of Paris-Saclay
arXiv (2024)
Factuality Benchmark Pretraining QA

📝 Paper Summary

LLM as Knowledge Base Legal Domain Adaptation
Evaluates LLMs as legal knowledge bases using flexible matching criteria and abstention strategies, finding that domain-specific pre-training and few-shot examples significantly improve precision over generic models.
Core Problem
General-purpose LLMs hallucinate frequently in the legal domain, and standard exact-match evaluations underestimate their knowledge by penalizing valid verbose or aliased answers.
Why it matters:
  • Legal hallucinations can lead to harmful decisions and sanctions for lawyers (e.g., citing fictitious cases)
  • Strict evaluation metrics (Exact Match) fail to capture correct but phrased-differently legal facts, misrepresenting model utility
  • Generic LLMs lack specific entities (local judges) and relations (case majority opinions) crucial for professional legal research
Concrete Example: When asked 'What is the legislation of the case Rummel v. Estelle?', the correct answer is 'United States'. An LLM might answer 'The case... applies to the state of Louisiana, United States'. Exact match marks this wrong, but fuzzy matching correctly identifies it as valid.
Key Novelty
Realistic Legal Factuality Evaluation (LexFact)
  • Introduces 'LexFact', a dataset of 8,920 atomic legal facts (case law and legislation) derived from Wikidata for evaluating LLMs as knowledge bases
  • Implements a realistic evaluation protocol allowing models to abstain ('I don't know') to boost precision and using alias/fuzzy matching to handle verbose legal language
  • Demonstrates that combining domain-specific pre-training (SaulLM) with few-shot abstention prompts drastically reduces hallucinations compared to generic models
Evaluation Highlights
  • SaulLM achieves 81% precision with few-shot prompting and abstention, compared to just 63% for the base Mistral-7B model
  • Switching from Exact Match to Fuzzy Matching improves reported precision significantly (e.g., SaulLM few-shot jumps from 36% EM to 81% FM)
  • Few-shot prompting corrects pattern errors; e.g., zero-shot models often reversed 'plaintiff v. defendant' roles, which in-context examples fixed
Breakthrough Assessment
7/10
Strong methodological contribution in realistic evaluation (fuzzy matching + abstention) and a new domain dataset. Shows significant gains via domain-pretraining, though recall remains a trade-off.
×