Measuring and enhancing trustworthiness of LLMs inRAGthrough grounded attributions and learning to refuse

📝 Paper Summary

RAG Evaluation Hallucination Suppression

The paper introduces Trust-Score, a metric decoupling LLM performance from retrieval quality to measure true groundedness, and Trust-Align, a preference optimization method that significantly reduces hallucinations and improves refusal behavior.

Core Problem

Current RAG evaluations conflate retrieval quality with LLM generation quality and often reward models for answering correctly using parametric knowledge rather than retrieved documents.

Why it matters:

Reliable RAG systems must base answers solely on retrieved documents to avoid hallucinations, yet current metrics fail to distinguish between document-grounded answers and lucky guesses based on pre-training data.
Existing prompting methods (like in-context learning) make models overly sensitive, leading to either exaggerated refusals or excessive responsiveness, failing to balance the two.

Concrete Example: A model might correctly answer a question about a specific event using its internal training data even if the retrieved documents are irrelevant. Standard metrics (like RAGAS) would score this highly, masking the fact that the model hallucinated the connection to the provided documents and failed the core RAG task of grounding.

Key Novelty

Trust-Score Metric and Trust-Align Framework

Trust-Score: A composite metric that specifically isolates the LLM's ability to ground answers in documents by measuring correct refusals (when documents are irrelevant), answer correctness limited by document content, and citation accuracy.
Trust-Align: An alignment framework using Direct Preference Optimization (DPO) on a custom dataset of 19K samples that explicitly pairs positive (grounded) responses against negative ones containing specific error types like over-responsiveness or bad citations.

Architecture

The Trust-Align pipeline involving dataset creation and DPO alignment.

Evaluation Highlights

+12.56 Trust-Score improvement on ASQA for LLaMA-3-8b using Trust-Align compared to the FRONT baseline.
+47.95% improvement in Correct Refusal rate on QAMPARI for LLaMA-3-8b using Trust-Align compared to FRONT.
+38.35% improvement in Citation Groundedness on QAMPARI for LLaMA-3-8b using Trust-Align compared to FRONT.

Breakthrough Assessment

8/10

Significant contribution in decoupling LLM evaluation from retriever performance. The proposed alignment method shows massive gains in refusal and citation quality across multiple benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Given a question q and retrieved documents D, generate a response S with inline citations C, or a refusal if D is insufficient.

Inputs: Question q, Set of retrieved documents D

Outputs: Response S containing citation-grounded statements {s1, ..., sn} or a refusal statement.

Pipeline Flow

Input: Question + Documents
LLM Generation (Answer or Refuse)
Evaluation via Trust-Score (Metric Calculation)

System Modules

LLM Generator

Generates the response with citations or a refusal message.

Model or implementation: Various (LLaMA-3-8b, Qwen-2.5-7b, Phi-3.5, etc.)

Metric Calculator

Computes Trust-Score components (Refusal, Correctness, Citation).

Model or implementation: NLI Model (for citation verification) + Scripted Logic

Novel Architectural Elements

Trust-Align pipeline: Construction of a specific 19K DPO dataset covering 5 hallucination types (Inaccurate Answer, Over-Responsiveness, Excessive Refusal, Overcitation, Improper Citation) to align the Generator.

Modeling

Base Model: LLaMA-3-8b-Instruct (primary), also tested on LLaMA-3-1b/3b, Qwen-2.5-0.5b/1.5b/3b/7b, Phi-3.5-mini

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Align model to prefer grounded/correct responses over hallucinated/unsupported ones.

Formally: Standard DPO loss maximizing likelihood of preferred response relative to reference model.

Adaptation: Full fine-tuning (implied by context of DPO on open weights, though specific adaptation method like LoRA not explicitly ruled out/specified in summary text)

Training Data:

19K samples derived from ASQA, QAMPARI, ELI5 training sets.
Includes positive (grounded) and negative (hallucinated) pairs.

Key Hyperparameters:

beta: 0.1
learning_rate: 5e-7
batch_size: 64
+ 1 more
epochs: 2

Compute: Not reported in the paper

Comparison to Prior Work

vs. RAGAS/ARES: Trust-Score explicitly accounts for refusal capabilities and decouples retrieval quality from LLM quality.
vs. FRONT: Trust-Align uses DPO with specific negative mining for 5 hallucination types, whereas FRONT focuses on rejection sampling.
vs. Self-RAG: Trust-Align does not require special tokens or separate critique heads; it aligns the base generation directly.

Limitations

The Answer Correctness metric can yield mixed results because correcting 'Over-Responsiveness' (making the model refuse more) naturally reduces the raw count of questions answered, even if they were answered correctly via parametric knowledge.
Reliance on an NLI model for citation verification means the metric is bounded by the NLI model's performance.
The approach focuses on 'complete groundedness' (IR task), which might differ from applications where mixing parametric knowledge is desired.

Reproducibility

Code: https://github.com/declare-lab/trust-align

Code is publicly available at https://github.com/declare-lab/trust-align. The alignment dataset (19K samples) and evaluation scripts are implied to be part of the release.

📊 Experiments & Results

Evaluation Setup

RAG evaluation on open-domain QA datasets using provided (gold or retrieved) documents to isolate LLM performance.

Benchmarks:

ASQA (Ambiguous Factoid QA (Long-form))
QAMPARI (List-based QA)
ELI5 (Long-form Explanatory QA)

Metrics:

Trust-Score (Primary)
F1_GR (Grounded Refusals)
F1_AC (Answer Correctness)
F1_GC (Groundedness of Citations)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Trust-Align consistently outperforms the strong baseline (FRONT) on the aggregate Trust-Score metric across all three datasets for LLaMA-3-8b.
ASQA	Trust-Score	45.74	58.30	+12.56
QAMPARI	Trust-Score	31.95	67.99	+36.04
ELI5	Trust-Score	45.03	62.72	+17.69
Drilling down into sub-metrics reveals that Trust-Align dramatically improves the model's ability to refuse unanswerable questions (Grounded Refusals).
ASQA	F1_GR (Grounded Refusals)	57.75	81.62	+23.87
QAMPARI	F1_GR (Grounded Refusals)	49.50	97.45	+47.95
ELI5	F1_GR (Grounded Refusals)	51.27	97.04	+45.77
Citation quality (Groundedness of Citations) also sees consistent improvement, indicating the model is not just answering correctly but attributing correctly.
ASQA	F1_GC (Citation Groundedness)	69.17	91.29	+22.12
QAMPARI	F1_GC (Citation Groundedness)	42.94	81.29	+38.35

Experiment Figures

A diagram of the Trust-Score metric calculation.

Main Takeaways

Prompting methods (like in-context learning) are ineffective for RAG groundedness; they tend to make models either refuse everything or answer everything, failing to find the balance.
Trust-Align enables 26 out of 27 model configurations to substantially outperform competitive baselines, showing robustness across model families (LLaMA, Qwen, Phi) and sizes.
While Trust-Align improves refusal and citation metrics dramatically, Answer Correctness shows mixed results. This is partly due to 'gamification' in baselines where models use parametric knowledge to answer unanswerable questions (inflating scores), which Trust-Align correctly suppresses.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Direct Preference Optimization (DPO)
NLI (Natural Language Inference) for entailment checking
F1 Score calculations

Key Terms

Parametric Knowledge: Information stored in the model's pre-trained weights/parameters rather than provided in the context.

Groundedness: The extent to which a model's response is derived solely from the provided retrieved documents.

Trust-Score: The proposed holistic metric averaging Grounded Refusals, Answer Correctness, and Groundedness of Citations.

DPO: Direct Preference Optimization—an algorithm for fine-tuning LLMs to align with human preferences using pairs of preferred and dispreferred outputs.

Hallucination (in RAG): Errors where the model invents information, fails to use documents, refuses when it shouldn't, or cites incorrectly.

Answerability: Whether the provided documents D contain sufficient information to answer question q.

NLI: Natural Language Inference—a task determining if a premise entails a hypothesis; used here to verify if a cited document actually supports a claim.

ASQA: Ambiguous SQuAD—a QA dataset focusing on ambiguous questions requiring long-form answers.

QAMPARI: A QA benchmark requiring answers that consist of lists of entities.

ELI5: Explain Like I'm 5—a long-form QA dataset requiring detailed explanations.