Graph-based Uncertainty Metrics for Long-form Language Model Outputs

📝 Paper Summary

Hallucination suppression Granular uncertainty estimation Uncertainty-aware decoding

Graph Uncertainty constructs a bipartite graph mapping generated responses to individual claims and uses graph centrality metrics like closeness centrality to estimate the factual reliability of each claim.

Core Problem

Existing uncertainty estimation methods for LLMs often operate at the response level rather than the claim level, or rely on simple frequency counting (self-consistency) that ignores complex semantic relationships.

Why it matters:

LLMs frequently generate long-form text containing a mix of true and false claims, requiring granular detection rather than a binary correct/incorrect label for the whole text
Current methods like self-consistency do not fully leverage the structural entailment relationships between different responses and claims, missing signals that could improve reliability
Users cannot trust long-form generations without knowing which specific parts are hallucinations, hindering deployment in high-stakes applications

Concrete Example: If an LLM generates a paragraph about a politician, it might correctly state their birth year but hallucinate their election date. A simple frequency count might miss that the election date claim is only supported by a few outlier responses that contradict the majority of semantic evidence, whereas a graph-based view would isolate it.

Key Novelty

Graph-based Uncertainty Estimation with Centrality Metrics

Constructs a bipartite graph where one set of nodes represents generated responses and the other represents extracted claims, with edges representing entailment
Generalizes the popular 'self-consistency' method (which uses degree centrality) by applying more sophisticated graph metrics like closeness centrality to determine claim reliability
Uses these granular uncertainty scores to filter out unreliable claims during decoding, synthesizing a final response that preserves high-confidence information

Architecture

Illustration of the Graph Uncertainty framework. It shows the process from Response Sampling -> Claim Decomposition -> Bipartite Graph Construction -> Centrality Calculation.

Evaluation Highlights

Outperforms baselines by an average of 6.8% on AUPRC (Area Under Precision-Recall Curve) for claim-level uncertainty estimation on FactScore and PopQA datasets
Achieves consistent 2-4% gains in factuality (FactScore) over existing decoding techniques while maintaining informativeness
Generates 70% more true claims at the 95% precision level compared to baseline methods like Self-Consistency and Verbalized Confidence

Breakthrough Assessment

8/10

Significantly generalizes the dominant self-consistency paradigm into a graph framework. The empirical gains on granular claim-level detection are substantial, addressing a critical bottleneck in long-form generation.

⚙️ Technical Details

Problem Definition

Setting: Granular uncertainty estimation for black-box LLMs in long-form generation

Inputs: Prompt x and a generated response y containing multiple claims

Outputs: Uncertainty score U(x, c) for each individual claim c within the response

Pipeline Flow

Response Sampling (generate N variations)
Claim Extraction & Merging (decompose responses into atomic claims)
Graph Construction (build Response-Claim bipartite graph)
Uncertainty Estimation (calculate Graph Centrality)
Uncertainty-Aware Decoding (filter and synthesize final response)

System Modules

Response Sampler (Graph Construction)

Generate a diverse set of responses from the model

Model or implementation: Target LLM (e.g., GPT-4, Llama-3)

Claim Processor (Graph Construction)

Decompose responses into atomic claims and merge semantically identical ones

Model or implementation: Target LLM (prompted)

Edge Constructor (Graph Construction)

Determine entailment edges between responses and claims

Model or implementation: Target LLM (prompted)

Centrality Calculator

Compute uncertainty scores based on graph topology

Model or implementation: Algorithm (e.g., NetworkX implementation)

Uncertainty-Aware Decoder

Filter low-confidence claims and synthesize final output

Model or implementation: Target LLM (prompted)

Novel Architectural Elements

Bipartite Entailment Graph representing the many-to-many relationship between sampled responses and decomposed claims
Application of Closeness Centrality (and other graph metrics) as a proxy for claim truthfulness, generalizing beyond simple frequency counts

Modeling

Base Model: GPT-4 and Llama-3-70B-Instruct (used as black-box models for experiments)

Training Method: Inference-time intervention / Decoding strategy

Adaptation: None (Prompt-based interaction)

Trainable Parameters: None

Compute: Not reported in the paper

Comparison to Prior Work

vs. Self-Consistency: SC is a special case (degree centrality) of Graph Uncertainty; Graph Uncertainty uses global graph structure (closeness) rather than just local connections
vs. Verbalized Confidence: Graph Uncertainty uses consistency across samples rather than relying on the model's potentially uncalibrated introspection
vs. Manakul et al. (2023): Manakul focuses on sentence-level self-consistency; this work operates at the finer claim level and introduces graph topology metrics
+ 1 more
vs. Safetensor/Logit methods [not cited in paper]: Does not require access to model weights or logits, making it applicable to closed APIs

Limitations

Computational cost is high due to multiple LLM calls for sampling, claim decomposition, and entailment checking
Relies on the accuracy of the LLM for claim extraction and entailment verification (if the model fails at these, the graph is flawed)
Graph construction latency may be prohibitive for real-time interactive applications

Reproducibility

Code: https://github.com/Mingjianjiang-1/Graph-based-Uncertainty

Code available at https://github.com/Mingjianjiang-1/Graph-based-Uncertainty. Prompts for claim decomposition, merging, and entailment checking are provided in Appendix F. The method relies on API access to models like GPT-4 or local inference of Llama-3.

📊 Experiments & Results

Evaluation Setup

Long-form generation tasks evaluated for claim-level factuality

Benchmarks:

FactScore (Biography generation)
PopQA (Long-form QA (modified for long-form))

Metrics:

AUROC (Area Under Receiver Operating Characteristic)
AUPRC (Area Under Precision-Recall Curve)
FactScore (percentage of atomic claims that are true)
Number of True Claims (informativeness)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Uncertainty Estimation Performance: Comparison of graph centrality metrics against baselines (Self-Consistency/Degree, Verbalized Confidence) on GPT-4.
FactScore (GPT-4)	AUPRC	89.0	95.5	+6.5
PopQA (GPT-4)	AUPRC	69.0	76.1	+7.1
FactScore (GPT-4)	AUROC	62.8	80.4	+17.6
Factuality Improvement: Evaluating the end-to-end system using uncertainty-aware decoding.
FactScore (Bios)	FactScore %	73.4	77.1	+3.7
FactScore	Number of True Claims @ 95% Precision	Not reported as exact number	Not reported as exact number	+70% (relative)

Main Takeaways

Closeness Centrality consistently outperforms Degree Centrality (Self-Consistency), suggesting that the global structure of the entailment graph holds valuable information about truthfulness.
Graph-based methods significantly outperform Verbalized Confidence, confirming that consistency is a better proxy for truth than model introspection in black-box settings.
The method allows for a better tradeoff between precision (factuality) and recall (informativeness), retaining more true information while filtering hallucinations compared to aggressive filtering based on simple counts.

📚 Prerequisite Knowledge

Prerequisites

Basic graph theory (nodes, edges, bipartite graphs)
Graph centrality metrics (degree, closeness, PageRank)
Large Language Model generation (temperature sampling)
Self-consistency / Ensemble methods

Key Terms

bipartite graph: A graph where nodes are divided into two disjoint sets (here, responses and claims), and edges only connect nodes from different sets

entailment: A relationship where the truth of one statement (the response) guarantees the truth of another (the claim)

centrality metrics: Measures used to determine the importance of a node in a graph; examples include Degree (number of connections) and Closeness (average distance to others)

self-consistency (SC): A method that samples multiple reasoning paths or answers from an LLM and selects the most consistent one; shown here to be equivalent to Degree Centrality

AUPRC: Area Under the Precision-Recall Curve—a performance metric for classification tasks, particularly useful when classes are imbalanced

FactScore: A metric/benchmark for evaluating the factuality of long-form text generation by breaking text into atomic claims and verifying them

degree centrality: A simple measure of importance based on counting the number of direct connections a node has

closeness centrality: A measure of importance based on the average length of the shortest paths between a node and all other nodes in the graph

betweenness centrality: A measure of importance based on the number of times a node acts as a bridge along the shortest path between two other nodes

PageRank: An algorithm that measures node importance by counting the number and quality of links to the node (originally used by Google for web pages)

verbalized confidence (VC): Prompting the LLM to explicitly state its confidence (e.g., 'I am 90% sure') in its own output

greedy decoding: A generation strategy where the model always selects the highest-probability token at each step