The Factuality of Large Language Models in the Legal Domain

📝 Paper Summary

LLM as Knowledge Base Legal Domain Adaptation

Evaluates LLMs as legal knowledge bases using flexible matching criteria and abstention strategies, finding that domain-specific pre-training and few-shot examples significantly improve precision over generic models.

Core Problem

General-purpose LLMs hallucinate frequently in the legal domain, and standard exact-match evaluations underestimate their knowledge by penalizing valid verbose or aliased answers.

Why it matters:

Legal hallucinations can lead to harmful decisions and sanctions for lawyers (e.g., citing fictitious cases)
Strict evaluation metrics (Exact Match) fail to capture correct but phrased-differently legal facts, misrepresenting model utility
Generic LLMs lack specific entities (local judges) and relations (case majority opinions) crucial for professional legal research

Concrete Example: When asked 'What is the legislation of the case Rummel v. Estelle?', the correct answer is 'United States'. An LLM might answer 'The case... applies to the state of Louisiana, United States'. Exact match marks this wrong, but fuzzy matching correctly identifies it as valid.

Key Novelty

Realistic Legal Factuality Evaluation (LexFact)

Introduces 'LexFact', a dataset of 8,920 atomic legal facts (case law and legislation) derived from Wikidata for evaluating LLMs as knowledge bases
Implements a realistic evaluation protocol allowing models to abstain ('I don't know') to boost precision and using alias/fuzzy matching to handle verbose legal language
Demonstrates that combining domain-specific pre-training (SaulLM) with few-shot abstention prompts drastically reduces hallucinations compared to generic models

Evaluation Highlights

SaulLM achieves 81% precision with few-shot prompting and abstention, compared to just 63% for the base Mistral-7B model
Switching from Exact Match to Fuzzy Matching improves reported precision significantly (e.g., SaulLM few-shot jumps from 36% EM to 81% FM)
Few-shot prompting corrects pattern errors; e.g., zero-shot models often reversed 'plaintiff v. defendant' roles, which in-context examples fixed

Breakthrough Assessment

7/10

Strong methodological contribution in realistic evaluation (fuzzy matching + abstention) and a new domain dataset. Shows significant gains via domain-pretraining, though recall remains a trade-off.

⚙️ Technical Details

Problem Definition

Setting: Querying an LLM as a Knowledge Base (LM-as-KB) for atomic legal facts

Inputs: Natural language question q about a subject-relation pair (s, r)

Outputs: Predicted answer text or an abstention ('I don't know')

Pipeline Flow

Question Generation (Templates applied to Wikidata facts)
Prompt Construction (Zero-shot or Few-shot with Abstention Instruction)
Model Inference (Generate answer or 'I don't know')
Answer Evaluation (Exact, Alias, or Fuzzy Matching)

System Modules

Prompt Constructor

Formats the input question with optional in-context examples and abstention instructions

Model or implementation: N/A

LLM Inference

Generates the answer based on the prompt

Model or implementation: Various (e.g., SaulLM-7B, Mistral-7B, Llama-3-8B)

Evaluator

Determines correctness of the answer against ground truth

Model or implementation: Rule-based matching

Novel Architectural Elements

Evaluation pipeline integrating Alias and Fuzzy matching with specific post-processing rules for legal jargon (e.g., handling verbose case descriptions)

Modeling

Base Model: SaulLM-7B-Instruct (derived from Mistral-7B)

Training Method: Continued Pre-training on legal corpus (SaulLM specific)

Compute: Experiments limited to open-source models <8B parameters runnable on local machine. Specific training compute for SaulLM not reported in this paper (refers to external work).

Comparison to Prior Work

vs. LAMA/KAMEL: Focuses specifically on Legal domain atomic facts (jurisdiction, legislation, case opinions) rather than general knowledge
vs. Standard Evaluation: Incorporates Abstention ('I don't know') and Fuzzy Matching as core metrics, unlike strict Exact Match used in LAMA
vs. RAG approaches [not cited in paper]: Evaluates internal knowledge storage (parametric knowledge) rather than retrieval-augmented capabilities

Limitations

Dataset relies on Wikidata, which may have incomplete coverage of domain-specific legal nuances
Fuzzy matching can be prone to false positives (e.g., answer contains the label but is factually wrong contextually), requiring manual post-processing rules
High precision in SaulLM (81%) comes at the cost of recall due to high abstention rates
Evaluation limited to atomic facts, not complex legal reasoning or case synthesis

Reproducibility

Code: https://github.com/Rajjaa/LexFact

📊 Experiments & Results

Evaluation Setup

Question Answering on 8,920 atomic legal facts derived from Wikidata

Benchmarks:

LexFact (Proposed) (Legal Knowledge Probing (QA)) [New]

Metrics:

Precision (P = Correct / Answered)
Recall (R = Correct / Total Questions)
Abstain Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Impact of evaluation metrics: Fuzzy Matching (FM) significantly reveals more knowledge than Exact Match (EM).
LexFact	Precision (SaulLM, Few-shot + Abstain)	36	81	+45
LexFact	Precision (Mistral-7B, Few-shot + Abstain)	8	63	+55
Impact of Domain-Specific Training: SaulLM (Legal-trained) vs Mistral (Base).
LexFact	Precision (Fuzzy Match)	63	81	+18
Impact of Prompting Strategy: Few-shot examples improve precision.
LexFact	Precision Improvement	Not reported as single aggregate	Not reported as single aggregate	Positive

Experiment Figures

Precision scores of 8 models under 4 settings: Zero-shot/Few-shot crossed with Forced-Answer/Abstain-Allowed.

Main Takeaways

Exact matching drastically underestimates LLM knowledge; Alias and Fuzzy matching are essential for realistic evaluation of verbose models.
Abstention instructions ('answer I don't know') successfully increase precision across most models, though at the cost of recall.
Domain-specific pre-training (SaulLM) yields the highest factual precision (81%), significantly outperforming general-purpose models like Llama-3 and Mistral.
Few-shot prompting corrects systematic formatting errors (e.g., case title structure) and aligns output types (e.g., answering 'USA' instead of state names for jurisdiction).

📚 Prerequisite Knowledge

Prerequisites

Knowledge of LLM evaluation metrics (Precision/Recall)
Understanding of In-Context Learning (Few-shot prompting)
Basic legal terminology (Case law, Legislation)

Key Terms

LM-as-KB: Language Models as Knowledge Bases—using an LLM to retrieve facts directly via natural language queries rather than querying a structured database

Exact Match (EM): Evaluation metric where the generated answer must be identical to the ground truth label

Alias Matching (AM): Evaluation metric where the answer is correct if it matches the label or any known alternative names (aliases) from Wikidata

Fuzzy Matching (FM): Evaluation metric where the answer is correct if it *contains* the label or any of its aliases, allowing for verbose responses

SaulLM: A 7-billion parameter language model based on Mistral-7B, further pre-trained on a large corpus of legal documents

Abstention: The ability of a model to refuse to answer ('I don't know') when uncertain, used to increase precision by reducing hallucinations

Zero-shot: Prompting the model with only the question, without providing example question-answer pairs

Few-shot: Prompting the model with the question plus a few (e.g., 5) example question-answer pairs to guide format and context