Factual Inconsistency in Data-to-Text Generation Scales Exponentially with LLM Size: A Statistical Validation

📝 Paper Summary

Factual Inconsistency Analysis Scaling Laws

Contrary to the widely assumed power law for general performance, factual inconsistency in data-to-text tasks decreases exponentially as large language model size increases.

Core Problem

While LLMs generally follow power laws for perplexity and generalization error, it is unknown if factual inconsistency (hallucination) in data-to-text generation follows the same trend.

Why it matters:

Monitoring factual inconsistency is essential for building trustworthy D2T systems (e.g., automated journalism, conversation systems) where hallucinations undermine user trust
Existing scaling laws focus on loss or perplexity, overlooking specific failure modes like factual errors, leaving a gap in understanding how model size mitigates hallucination

Concrete Example: When generating text from a structured table about a restaurant, a smaller model might hallucinate a dish not present in the input. The paper investigates if simply scaling the model size reduces these errors linearly, exponentially, or by a power law.

Key Novelty

Exponential Scaling of Factual Inconsistency

Investigates the relationship between LLM parameter count and factual inconsistency scores across diverse D2T datasets and model families
Establishes a rigorous three-stage statistical framework (predictive performance, goodness-of-fit, and hypothesis testing) to formally validate that an exponential model fits the data significantly better than a power law

Evaluation Highlights

Exponential scaling consistently provides a better statistical fit than power law scaling for factual inconsistency across 3 LLM families (Pythia, OPT, BLOOM) and 5 D2T datasets
Vuong's likelihood-ratio test confirms the superiority of the exponential model over the power law model with high significance (p < 0.005) in nearly all experimental configurations
Inconsistency metrics (AlignScore, QAFactEval, SummaC-conv, UniEval-fact) all broadly support the exponential trend, with only minor deviations in specific model-dataset pairs (e.g., BLOOM on E2E)

Breakthrough Assessment

7/10

Provides a significant empirical correction to the assumption that all LLM behaviors follow power laws, specifically for factual consistency. The rigorous statistical validation strengthens the finding.

⚙️ Technical Details

Problem Definition

Setting: Modeling the relationship between model size x and factual inconsistency f(x) using statistical regression models

Inputs: LLM size (parameter count x) and corresponding factual inconsistency score derived from automatic metrics

Outputs: Parameters for Power Law (A, B, α) and Exponential (C, D, β) scaling models

Pipeline Flow

Fine-tuning (QLoRA on specific D2T dataset)
Generation (Nucleus Sampling)
Evaluation (Consistency Metrics)
Scaling Law Fitting (Huber Loss Minimization)

System Modules

Fine-tuning

Adapt the base LLM to the specific D2T task

Model or implementation: Pythia / OPT / BLOOM (various sizes)

Metric Calculation

Compute factual inconsistency scores

Model or implementation: AlignScore / QAFactEval / SummaC / UniEval

Modeling

Base Model: Pythia (70M–12B), OPT (130M–13B), BLOOM (560M-7B)

Training Method: QLoRA (Quantized Low-Rank Adapter)

Objective Functions:

Purpose: Minimize regression error for scaling law fitting.

Formally: Huber Loss (delta=1)

Adaptation: QLoRA (rank r=16 for attention module)

Trainable Parameters: LoRA adapters only

Training Data:

E2E (37K pairs)
ViGGO (7K pairs)
WikiTableText (13K pairs)
DART (70K triplets)
WebNLG (38K samples)

Key Hyperparameters:

learning_rate: 1.00e-04
lora_rank: 16
huber_delta: 1

Compute: Not reported in the paper

Comparison to Prior Work

vs. Kaplan/Hoffmann: Focuses on factual inconsistency rather than perplexity/loss; finds exponential scaling instead of power law

Limitations

Study limited to three model families (Pythia, OPT, BLOOM) and five datasets; other architectures (e.g., Llama) not tested.
Relies on automatic metrics (AlignScore, etc.) as proxies for human judgment of factuality.
Only explores decoder-only autoregressive models.
Results primarily presented for nucleus sampling; other decoding strategies are in appendix.

Reproducibility

Datasets are public (E2E, ViGGO, WikiTableText, DART, WebNLG). Models are open-source (Pythia, OPT, BLOOM). Evaluation metrics are standard open-source libraries. Exact code for the scaling law fitting pipeline is not provided.

📊 Experiments & Results

Evaluation Setup

Fine-tune models of varying sizes on D2T datasets, measure factual inconsistency, and fit scaling laws.

Benchmarks:

E2E (MR-to-text (Restaurant domain))
ViGGO (MR-to-text (Video game domain))
WikiTableText (Table-to-text (Open domain))
DART (Graph-to-text (Open domain))
WebNLG (RDF-to-text (Open domain))

Metrics:

AlignScore
QAFactEval
SummaC-conv
UniEval-fact
Statistical methodology: 3-stage framework: (1) Predictive performance (5-fold CV Huber loss), (2) Goodness-of-fit (F-test, p<0.05), (3) Comparative analysis (Vuong’s likelihood-ratio test, p<0.005)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis results using Vuong's test across datasets and metrics consistently favor the Exponential model over the Power Law model.
E2E (Pythia Family)	Vuong's Test Decision (AlignScore)	Rejected	Selected	Exponential Preferred
ViGGO (OPT Family)	Vuong's Test Decision (AlignScore)	Rejected	Selected	Power Law Preferred
DART (Pythia Family)	Vuong's Test Decision (QAFactEval)	Rejected	Selected	Exponential Preferred
E2E (BLOOM Family)	F-Test Status (AlignScore)	Not reported in the paper	Not Qualified	Failed

Experiment Figures

Fitted curves for Power Law vs. Exponential Scaling on AlignScore across all datasets and models.

Fitted curves for Power Law vs. Exponential Scaling on QAFactEval.

Main Takeaways

Factual inconsistency in D2T generally follows an exponential scaling law with respect to model size, rather than the power law observed for perplexity.
This trend holds across different metrics (AlignScore, QAFactEval, etc.) and model families (Pythia, OPT), with high statistical significance.
The BLOOM model family exhibits more irregular behavior, often failing goodness-of-fit tests for both scaling laws on certain datasets (e.g., E2E).
Predictive performance (low Huber loss) is necessary but not sufficient for validating a scaling law; formal goodness-of-fit tests are crucial.

📚 Prerequisite Knowledge

Prerequisites

Scaling Laws (Power Law vs. Exponential)
Data-to-Text (D2T) Generation tasks
Statistical Hypothesis Testing (F-test, Vuong's test)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

D2T: Data-to-Text generation—converting structured data (tables, graphs) into natural language

Factual Inconsistency: The failure of generated text to entail the input facts, often resulting in hallucinations; measured as 1 - Consistency Score

Huber Loss: A loss function used in robust regression that is less sensitive to outliers in data than squared error loss

Vuong's test: A statistical likelihood-ratio test used to compare non-nested models (like power law vs. exponential) to see which fits the data better

QLoRA: Quantized Low-Rank Adapter—a parameter-efficient fine-tuning technique that reduces memory usage by quantizing the base model

AlignScore: A metric measuring factual consistency based on information alignment between source and generation

QAFactEval: A metric assessing consistency via question generation and answering (QG-QA) pipelines

SummaC-conv: A consistency metric leveraging natural language inference (NLI) models

UniEval-fact: A unified multi-dimensional evaluator where the 'fact' dimension measures factual consistency

Goodness-of-fit: A statistical test (here, F-test) determining how well a model's predicted values match the observed data

Pythia: A suite of decoder-only autoregressive language models designed for research on training dynamics

OPT: Open Pre-trained Transformer—a suite of open-source decoder-only models similar to GPT-3

BLOOM: BigScience Large Open-science Open-access Multilingual Language Model

Nucleus Sampling: A decoding strategy where the next token is sampled from the smallest set of top tokens whose cumulative probability exceeds a threshold p