← Back to Paper List

Factual Inconsistency in Data-to-Text Generation Scales Exponentially with LLM Size: A Statistical Validation

Joy Mahapatra, Soumyajit Roy, Utpal Garain
Indian Statistical Institute Kolkata
arXiv (2025)
Factuality Benchmark

📝 Paper Summary

Factual Inconsistency Analysis Scaling Laws
Contrary to the widely assumed power law for general performance, factual inconsistency in data-to-text tasks decreases exponentially as large language model size increases.
Core Problem
While LLMs generally follow power laws for perplexity and generalization error, it is unknown if factual inconsistency (hallucination) in data-to-text generation follows the same trend.
Why it matters:
  • Monitoring factual inconsistency is essential for building trustworthy D2T systems (e.g., automated journalism, conversation systems) where hallucinations undermine user trust
  • Existing scaling laws focus on loss or perplexity, overlooking specific failure modes like factual errors, leaving a gap in understanding how model size mitigates hallucination
Concrete Example: When generating text from a structured table about a restaurant, a smaller model might hallucinate a dish not present in the input. The paper investigates if simply scaling the model size reduces these errors linearly, exponentially, or by a power law.
Key Novelty
Exponential Scaling of Factual Inconsistency
  • Investigates the relationship between LLM parameter count and factual inconsistency scores across diverse D2T datasets and model families
  • Establishes a rigorous three-stage statistical framework (predictive performance, goodness-of-fit, and hypothesis testing) to formally validate that an exponential model fits the data significantly better than a power law
Evaluation Highlights
  • Exponential scaling consistently provides a better statistical fit than power law scaling for factual inconsistency across 3 LLM families (Pythia, OPT, BLOOM) and 5 D2T datasets
  • Vuong's likelihood-ratio test confirms the superiority of the exponential model over the power law model with high significance (p < 0.005) in nearly all experimental configurations
  • Inconsistency metrics (AlignScore, QAFactEval, SummaC-conv, UniEval-fact) all broadly support the exponential trend, with only minor deviations in specific model-dataset pairs (e.g., BLOOM on E2E)
Breakthrough Assessment
7/10
Provides a significant empirical correction to the assumption that all LLM behaviors follow power laws, specifically for factual consistency. The rigorous statistical validation strengthens the finding.
×