FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in Finance

📝 Paper Summary

Financial LLM Evaluation Hallucination Detection

FAITH is an automated framework that evaluates intrinsic hallucinations in financial LLMs by masking numerical spans in annual reports and checking if models can recover them through context-aware reasoning.

Core Problem

Existing hallucination benchmarks rely on general-domain text (e.g., Wikipedia) and simple look-up tasks, failing to capture the complex, context-dependent numerical reasoning required for financial tabular data.

Why it matters:

Financial decisions rely on precise extraction and calculation from proprietary tables; even minor numerical errors can undermine decision-making and regulatory compliance
Manual annotation for finance is resource-intensive and unscalable, while current automated methods lack the domain specificity to handle complex financial reasoning patterns
Undetected hallucinations in automated reporting or investment algorithms can propagate through pipelines, leading to compliance violations or financial losses

Concrete Example: If a table reports operating income as $500 million, a model might hallucinate 'Operating income was $500 thousand' due to scale confusion, or fail to correctly calculate a year-over-year percentage change explicitly mentioned in the text but derived from table values.

Key Novelty

Context-Aware Masked Span Prediction for Tables

Treats hallucination evaluation as a fill-in-the-blank task where models must recover masked numerical values in financial text using evidential tables
Introduces a taxonomy of four financial reasoning types (Direct Lookup, Comparative, Bivariate, Multivariate) to stratify evaluation by complexity
Implements a 'precision-relaxed' evaluation protocol that normalizes numerical formats (e.g., '$1.2B' vs '1,200 million') to penalize only factual errors, not formatting differences

Architecture

The FAITH framework workflow: from parsing financial documents to masking spans, verifying answerability, and evaluating model predictions

Evaluation Highlights

Proprietary models (Claude-Sonnet-4, Gemini-2.5-Pro) achieve high overall accuracy but still exhibit 10–20% error rates on multi-step numerical reasoning tasks
Open-source models fail catastrophically on complex calculations, scoring near zero in multivariate scenarios, highlighting a major gap in reasoning capabilities
Validation study shows 96.2% accuracy for unanimous LLM consensus in annotating answerability, proving the automated dataset construction is reliable

Breakthrough Assessment

7/10

Strong contribution to domain-specific evaluation, offering a scalable automated method for finance. While methodologically sound, it focuses on evaluation rather than a new model architecture.

⚙️ Technical Details

Problem Definition

Setting: Context-aware masked span prediction over tabular financial documents

Inputs: A corrupted sentence with a masked span [MASK] and context C_i containing tables T, pre-texts P, and surrounding sentences

Outputs: The predicted content m_hat for the masked span

Pipeline Flow

Document Pre-processing (Text/Table Partitioning)
Span Selection & Masking
Answerability Verification (LLM Consensus)
Model Prediction & Reasoning
Precision-Relaxed Evaluation

System Modules

Span Selector (Data Construction)

Identify non-overlapping numeric spans in sentences that include units or verbal scales

Model or implementation: Rule-based

Answerability Filter (Data Construction)

Verify if the masked span can be uniquely inferred from the provided tables and context

Model or implementation: Ensemble of GPT-4.1, Claude-Sonnet-4, Gemini-2.5-Pro

Evaluator

Compare predicted values against ground truth using precision-relaxed logic

Model or implementation: Algorithm 1 (Deterministic)

Novel Architectural Elements

Taxonomy-based evaluation flow: Classifies reasoning into 4 distinct types (Direct Lookup, Comparative, Bivariate, Multivariate) dynamically based on successful model reasoning paths

Comparison to Prior Work

vs. HaluEval: FAITH uses real ground-truth masking on proprietary financial data rather than synthetic hallucination generation
vs. FinQA: FAITH is an intrinsic hallucination benchmark using a fill-in-the-mask objective, ensuring faithfulness to context rather than just QA accuracy
vs. General Tabular Benchmarks: FAITH introduces a specific taxonomy for financial reasoning complexity (e.g., Bivariate vs. Multivariate calculation)

Limitations

Reliance on LLM consensus for answerability filtering might inherit biases from the annotator models
Focuses strictly on numerical spans, potentially missing semantic or qualitative hallucinations in financial text
The precision-relaxed evaluation might occasionally be too lenient if precision is critical for a specific financial use case

Reproducibility

The paper describes the dataset creation methodology (FAITH) in detail, including the filtering and masking logic. The dataset itself is derived from publicly available S&P 500 annual reports (10-Ks). Code and specific dataset files are not explicitly linked in the text.

📊 Experiments & Results

Evaluation Setup

Intrinsic hallucination detection on financial 10-K reports via masked span prediction

Benchmarks:

FAITH Dataset (Context-aware masked span prediction (financial)) [New]

Metrics:

Accuracy (Precision-Relaxed)
Statistical methodology: Validation of answerability using Fleiss' Kappa (0.905) for human agreement and accuracy stats for LLM agreement

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation of answerability annotation reliability compares LLM consensus against human ground truth.
FAITH Pilot	Accuracy (Unanimous LLM Consensus)	100.0	96.2	-3.8
Performance analysis across reasoning complexities shows degradation in multi-step tasks.
FAITH Dataset	Accuracy	0	0	0

Main Takeaways

Proprietary frontier models (Claude-Sonnet-4, Gemini-2.5-Pro) exhibit 10–20% error rates on multi-step financial reasoning, posing risks for high-stakes deployment
Model performance follows a stratified hierarchy: frontier models are reliable for lookups, but open-source models struggle significantly with complex calculations (Bivariate/Multivariate)
Unanimous LLM consensus is a highly reliable proxy for human annotation in determining the 'answerability' of financial text spans
Reasoning complexity is a key driver of hallucination; models degrade from Direct Lookup -> Comparative -> Bivariate -> Multivariate tasks

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model hallucination types (intrinsic vs. extrinsic)
Basic financial literacy (reading balance sheets, income statements)
Familiarity with masked language modeling concepts

Key Terms

intrinsic hallucination: A type of error where the model's output contradicts the provided source context (e.g., a table in the prompt), as opposed to contradicting external world knowledge

masked span prediction: A task where specific parts of a text are hidden (masked) and the model must predict the original content based on context

S&P 500: A stock market index tracking the stock performance of 500 of the largest companies listed on stock exchanges in the United States

10-K report: A comprehensive summary report of a company's financial performance submitted annually to the U.S. Securities and Exchange Commission

precision-relaxed evaluation: An evaluation method that normalizes numbers and compares them based on their significant digits to avoid penalizing valid formatting differences (e.g., 1M vs 1,000,000)

unit groups: Sets of aliased units (e.g., {$, USD, dollars}) used to match predicted units with ground truth regardless of specific phrasing

Fleiss' Kappa: A statistical measure for assessing the reliability of agreement between a fixed number of raters