Chain of Natural Language Inference for Reducing Large Language Model Ungrounded Hallucinations

📝 Paper Summary

Hallucination Detection Hallucination Mitigation Ungrounded Hallucination

CoNLI is a hierarchical framework that detects ungrounded hallucinations using sentence and entity-level Natural Language Inference chains, then uses these judgments to rewrite responses for better factuality.

Core Problem

LLMs frequently generate ungrounded hallucinations—text that conflicts with or cannot be verified against source documents—in text-to-text generation tasks.

Why it matters:

Business applications like search engines and coding assistants rely on factual consistency, but LLMs are prone to fabricating information.
Existing detection models (classifiers/rankers) identify errors but do not provide actionable guidance for rewriting or correcting the text.
Users of third-party LLM APIs often lack control over the model internals (decoding strategies) or cannot access external retrieval tools.

Concrete Example: If a source text says 'Paris is in France' but an LLM generates 'Paris is in Germany', standard generation models might not catch this conflict. CoNLI would detect 'Germany' as a hallucinated entity and rewrite the sentence to align with the source.

Key Novelty

Hierarchical Chain of Natural Language Inference (CoNLI)

Decomposes hallucination detection into a hierarchy: first checking full sentences against the source, then zooming in to check specific entities within non-hallucinated sentences to catch subtle errors.
Uses Chain-of-Thought prompting to guide an LLM to reason through these NLI checks (Entailment vs. Contradiction/Neutral) without needing domain-specific fine-tuning.
Integrates detection directly with mitigation by using the specific NLI reasoning as instructions for a rewriting agent to correct the text.

Architecture

The CoNLI framework workflow comprising the Detection Agent and Mitigation Agent.

Evaluation Highlights

Achieves state-of-the-art performance on hallucination detection benchmarks compared to latest solutions.
Refined responses show improvements over initial raw responses on various NLG evaluation metrics and groundedness metrics.
Demonstrates effectiveness across abstractive summarization and grounded question-answering scenarios without fine-tuning.

Breakthrough Assessment

7/10

Offers a practical, plug-and-play solution for black-box LLM users to reduce hallucinations via post-editing. While methodologically straightforward (NLI + prompting), the hierarchical approach (sentence + entity) addresses a key granularity issue in current detection methods.

⚙️ Technical Details

Problem Definition

Setting: Text-to-text generation where a raw response must be grounded in a source text

Inputs: Source text X and raw LLM response Y_raw

Outputs: Refined response Y_refined with reduced ungrounded hallucination

Pipeline Flow

Sentence Splitting & Filtering
Sentence-level Detection (NLI)
Entity-level Detection (NER + NLI)
Mitigation / Rewriting

System Modules

Sentence Splitter

Segment raw response into individual sentences/hypotheses

Model or implementation: NLTK sentence splitter

Sentence-level Detector (Detection)

Judge if each sentence is Entailment, Contradiction, or Neutral against source using CoT

Model or implementation: LLM (via API)

Entity-level Detector (Detection)

Re-check non-hallucinated sentences by focusing on specific entities to catch missed details

Model or implementation: NER model + LLM

Mitigation Agent

Rewrite the response to remove hallucinations while preserving essence

Model or implementation: LLM (via API)

Novel Architectural Elements

Hierarchical detection pipeline: Sentence-level NLI followed by Entity-level NLI on the *surviving* sentences
Integration of detection reasons directly as instructions for the mitigation/rewriting agent

Modeling

Base Model: LLM (specific model versions not explicitly detailed in paper body, implies general LLM API usage like GPT-4 or similar)

Training Method: Zero-shot / Few-shot prompting (In-context learning)

Adaptation: None (Plug-and-play framework)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Classification models (FactCC): CoNLI provides actionable reasoning for rewriting, not just a score.
vs. RARR: CoNLI does not require external retrieval; it relies solely on the provided source context.
vs. Self-Refine [not cited in paper]: CoNLI splits detection and mitigation into separate agents with a hierarchical NLI check, rather than a single model refining itself iteratively.

Limitations

Relies on the capability of the underlying LLM and NER models; failures in entity recognition propagate to detection.
Currently focuses only on ungrounded hallucination, ignoring self-conflicting or context-related hallucinations.
Performance depends on the quality of the few-shot examples used in prompting.

Reproducibility

Code: https://github.com/microsoft/CoNLI_hallucination

Code is publicly available at https://github.com/microsoft/CoNLI_hallucination. The paper provides prompts in Appendix D and E. Specific LLM versions used for experiments are not explicitly named in the main text (e.g., 'LLM API endpoint').

📊 Experiments & Results

Evaluation Setup

Hallucination detection and mitigation on text abstractive summarization and grounded QA

Benchmarks:

Text Abstractive Summarization benchmarks (Summarization)
Grounded Question-Answering benchmarks (QA)

Metrics:

Hallucination detection accuracy/F1
NLG evaluation metrics (implied, exact metrics not listed in excerpt)
Groundedness metrics
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper claims state-of-the-art performance and improvements, but the provided text excerpt does not contain specific result tables or numeric values. Therefore, specific key_result entries cannot be extracted from this snippet.

Main Takeaways

CoNLI achieves state-of-the-art performance on hallucination detection compared to latest solutions (qualitative claim).
Refined responses show improvements over initial responses in both text quality and groundedness (qualitative claim).
The hierarchical approach (sentence + entity) improves detection by catching subtle errors that sentence-level checks miss.
The framework is effective as a plug-and-play solution without domain-specific fine-tuning.

📚 Prerequisite Knowledge

Prerequisites

Natural Language Inference (NLI)
Chain-of-Thought (CoT) prompting
Named Entity Recognition (NER)
LLM Hallucination types (ungrounded vs. context-related)

Key Terms

NLI: Natural Language Inference—determining if a hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given a premise

ungrounded hallucination: Generated text that conflicts with or cannot be verified against the provided source text

CoT: Chain-of-Thought—prompting an LLM to generate intermediate reasoning steps before the final answer

NER: Named Entity Recognition—identifying specific entities like names, locations, and dates in text

post-editing: Correcting a generated text after it has been produced, rather than changing how it is generated initially