Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering

📝 Paper Summary

Legal Question Answering Hallucination Mitigation Benchmark Construction

This paper introduces a legal hallucination benchmark and a two-stage fine-tuning method combining behavior cloning with hard sample-aware iterative Direct Preference Optimization to improve statute citation accuracy.

Core Problem

General and specialized LLMs frequently hallucinate in legal contexts, fabricating statutes or providing irrelevant citations, and lack dedicated metrics to measure these specific errors.

Why it matters:

High-stakes legal domains require precise interpretation; fabricated laws or incorrect advice can have severe real-world consequences.
Existing legal LLMs (e.g., LawGPT, LexiLaw) still produce misleading responses, and general benchmarks fail to capture specific legal hallucination types like 'Non-existent Statute' or 'Irrelevant Statute'.
Manual annotation of legal hallucinations is prohibitively expensive and requires domain expertise.

Concrete Example: When asked about a specific legal scenario, GPT-4o-mini might cite a 'Section 302' of a law when it should be 'Section 303' (Wrong Number), or invent an 'Article 15 on Data Privacy' that does not exist (Non-existent Statute), leading to invalid legal advice.

Key Novelty

Hard Sample-aware Iterative DPO (HIPO) & LegalHalBench

HIPO iteratively filters training data by removing 'easy' samples (where the model already cites statutes correctly), forcing the model to learn from increasingly difficult negative examples.
LegalHalBench is the first benchmark specifically categorizing five distinct types of legal hallucinations (e.g., Wrong Statute Name, Irrelevant Statute) with automated metrics to detect them.
Uses a two-step automated data curation pipeline involving GPT-4-turbo to generate high-quality questions and answers grounded in actual legal provisions, reducing manual annotation costs.

Architecture

The overall framework including the automated dataset curation pipeline and the two-stage training process (SFT + HIPO).

Evaluation Highlights

Achieves 38.35% Non-Hallucinated Statute Rate (NHSR), significantly outperforming GPT-4o (29.28%) and Llama-3-8B-Instruct (13.63%).
Improves Statute Relevance Rate by 37.13% compared to the vanilla base model.
Achieves a dominant win rate in helpfulness evaluation against existing legal LLMs (LawGPT, LexiLaw) and general models.

Breakthrough Assessment

7/10

Strong contribution in domain-specific hallucination definition and benchmarking. The HIPO method is a solid advancement in preference learning, though the core innovation is the application logic rather than a new fundamental architecture.

⚙️ Technical Details

Problem Definition

Setting: Generative Legal Question Answering where the model must cite relevant statutes and provide truthful claims.

Inputs: Natural language legal question q

Outputs: Answer containing citations of legal statutes and legal analysis/claims

Pipeline Flow

Data Curation: Real questions + GPT-4 generated answers → Statute Verification → Provision-based Expansion
Stage 1: Supervised Fine-Tuning (Behavior Cloning) on curated data
Stage 2: HIPO (Iterative DPO) loop

System Modules

Data Curator

Constructs training data by refining CAIL2018 questions and generating new QAs from raw provisions

Model or implementation: GPT-4-turbo (used as generator)

SFT Model

Initializes the model to follow instructions and cite statutes

Model or implementation: Llama-3-8B-Instruct (base)

HIPO Trainer

Iteratively aligns model to avoid hallucinations using hard negative sampling

Model or implementation: SFT-adapted model

Novel Architectural Elements

Hard Sample-aware filtering mechanism within the DPO loop: specifically excludes training pairs where the rejected answer has no hallucinations and high semantic similarity to the ground truth.

Modeling

Base Model: Llama-3-8B-Instruct

Training Method: Two-stage: (1) SFT, (2) Hard Sample-aware Iterative DPO (HIPO)

Objective Functions:

Purpose: SFT Stage - Minimize negative log-likelihood of target tokens.

Formally: L_SFT = - sum log P(y_t | X, y_<t)
Purpose: HIPO Stage - Combine DPO loss with NLL loss on chosen responses to maintain generation quality.

Formally: L_HIPO = L_DPO + alpha * L_NLL

Training Data:

12,149 samples from CAIL2018 refined by GPT-4-turbo
3,883 samples generated from raw legal provisions via GPT-4-turbo

Key Hyperparameters:

learning_rate: 5e-6 (SFT), 1e-6 (HIPO)
batch_size: 128 (SFT), 64 (HIPO)
epochs: 3 (SFT), 3 iterations (HIPO)
+ 3 more
beta: 0.1 (DPO parameter)
max_length: 2048
alpha: Not reported in the paper (weight for NLL loss in HIPO)

Compute: 8 NVIDIA A800 GPUs

Comparison to Prior Work

vs. LawGPT/LexiLaw: These rely on continual pre-training/SFT. HIPO adds an iterative preference learning stage specifically targeting hallucination.
vs. Standard DPO: HIPO filters the preference data to remove 'easy' samples (where the model is already correct), focusing optimization on 'hard' samples where hallucinations persist.
vs. RAG-based methods: This paper focuses on 'internalizing' knowledge via training rather than retrieval, though RAG is acknowledged as a parallel approach.

Limitations

The method relies on proprietary models (GPT-4-turbo) for data curation and evaluation, which incurs cost and dependency.
The hard sample filtering threshold and similarity metrics rely on heuristics (NHSR, BERTScore) which may not capture all nuances of legal reasoning.
Evaluation is limited to Chinese legal contexts (implied by CAIL2018 source and examples, though not explicitly restricted in methodology).

Reproducibility

Code: https://github.com/YinghaoHu/LegalHalBench

Code and LegalHalBench dataset are publicly available at https://github.com/YinghaoHu/LegalHalBench. The paper details the prompts used for data generation and metric calculation in the Appendix. Base model is open-source (Llama-3-8B).

📊 Experiments & Results

Evaluation Setup

Legal Question Answering evaluated on the custom LegalHalBench.

Benchmarks:

LegalHalBench (Legal QA with statute citation) [New]

Metrics:

Non-Hallucinated Statute Rate (NHSR)
Statute Relevance Rate (Rel)
Legal Claim Truthfulness (T_LC)
Structure-aware Hallucination Rate (SHR)
General NLP metrics: ROUGE-L, BLEU-1, METEOR, BERTScore
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LegalHalBench	Non-Hallucinated Statute Rate (NHSR)	29.28	38.35	+9.07
LegalHalBench	Non-Hallucinated Statute Rate (NHSR)	13.63	38.35	+24.72
LegalHalBench	Statute Relevance Rate (Rel)	5.36	8.55	+3.19
LegalHalBench	Legal Claim Truthfulness (T_LC)	7.92	8.44	+0.52
LegalHalBench	NHSR	33.91	38.35	+4.44

Main Takeaways

SFT alone significantly improves legal citation capabilities compared to base models, but HIPO further refines this by specifically targeting hallucinated responses.
The proposed method outperforms specialized legal LLMs (like ChatLaw, LawGPT) which often perform worse than general SOTA models (GPT-4) on the strict NHSR metric.
Iterative training is crucial; the model's ability to discriminate between correct and hallucinated statutes improves as 'easy' correct samples are filtered out.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM fine-tuning (SFT)
Reinforcement Learning from Human Feedback (RLHF) concepts
Direct Preference Optimization (DPO)

Key Terms

DPO: Direct Preference Optimization—a method to align language models with preferences by optimizing a classification loss on preference pairs rather than training a reward model.

SFT: Supervised Fine-Tuning—training a model on a labeled dataset of high-quality instruction-response pairs.

HIPO: Hard Sample-aware Iterative Direct Preference Optimization—the authors' proposed method that iteratively selects difficult negative samples for DPO training.

NHSR: Non-Hallucinated Statute Rate—a metric measuring the proportion of cited statutes that are entirely accurate in name, number, and content.

BERTScore: A metric for evaluating text generation by computing token similarity using contextual embeddings.

Behavior Cloning: In this context, refers to the initial Supervised Fine-Tuning (SFT) stage where the model learns to mimic the provided high-quality legal answers.

NLL loss: Negative Log-Likelihood loss—the standard loss function used to train language models to predict the next token.