Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws

📝 Paper Summary

Theoretical Understanding of LLMs Scaling Laws Data Compression and Prediction

The paper explains LLM scaling laws, knowledge acquisition order, and hallucinations by modeling training as a compression process where models learn ubiquitous syntax first and rare knowledge later, bounded by capacity.

Core Problem

Despite the empirical success of Large Language Models (LLMs), there is a lack of principled theoretical explanations for their scaling laws, the emergence of hallucinations, and the specific order in which they acquire different types of information (syntax vs. knowledge).

Why it matters:

Understanding *why* scaling laws hold is crucial for predicting future model performance and resource requirements.
Identifying the mechanism behind hallucinations (compression artifacts) can inform better training strategies to mitigate them.
Traditional learning theory frameworks fail to fully explain phenomena like in-context learning or the specific dynamics of knowledge acquisition.

Concrete Example: When an LLM is trained on limited data or with limited capacity, it might perfectly generate grammatically correct sentences (syntax) but hallucinate facts about rare entities (knowledge). Current theories don't quantitatively explain why the syntax is learned before the rare facts, or why the model confidently generates the wrong fact instead of abstaining.

Key Novelty

Syntax-Knowledge Model via Kolmogorov Compression

Views LLM training as a two-part code optimization: minimizing the model description length plus the data's compressed length (log-likelihood).
Proposes a hierarchical generative model separating 'Syntax' (parametric, learns fast) from 'Knowledge' (non-parametric, power-law distributed), reflecting Zipf's and Heap's laws.
Demonstrates that hallucinations are inevitable compression artifacts when model capacity is insufficient to encode rare knowledge tail events.

Architecture

A schematic of the Kolmogorov Structure Function (a) and the proposed Syntax-Knowledge Data Generation Model (b).

Evaluation Highlights

Theoretically derives an upper bound for data scaling laws: O(1/N^(1-α) + 1/N) + H, where α is the Zipfian discount parameter, matching empirical observation.
Experimentally validates that models learn syntax patterns significantly faster than factual knowledge, which is acquired strictly according to frequency rank.
Shows that 'hallucinations' occur on knowledge elements that are frequent enough to be attempted but too rare to be perfectly memorized under capacity constraints.

Breakthrough Assessment

8/10

Provides a strong theoretical grounding for empirical scaling laws using Kolmogorov complexity. The Syntax-Knowledge decomposition offers a clean, intuitive explanation for why LLMs hallucinate and how they learn.

⚙️ Technical Details

Problem Definition

Setting: Lossless compression of a data sequence X drawn from a source distribution P using a predictive model M.

Inputs: Training corpus X_{1:N} (sequence of tokens).

Outputs: A predictive distribution P_M that minimizes the expected code length (equivalent to minimizing cross-entropy loss).

Pipeline Flow

Syntax Model Generation (Parametric)
Knowledge Model Generation (Non-parametric / Pitman-Yor)
Compression/Prediction (Minimizing Redundancy)

System Modules

Syntax Model (Data Generation / Theoretical Component)

Captures the structural/grammatical regularities of language that are ubiquitous and frequent.

Model or implementation: Parametric probabilistic model

Knowledge Model (Data Generation / Theoretical Component)

Encodes factual information (entities, facts) which follows a power-law distribution.

Model or implementation: Pitman-Yor Chinese Restaurant Process

Novel Architectural Elements

Theoretical separation of Syntax (parametric) and Knowledge (non-parametric Pitman-Yor) within a unified Bayesian coding framework to derive scaling bounds.

Modeling

Base Model: Syntax-Knowledge Model (Theoretical Framework)

Training Method: Theoretical analysis of Redundancy Minimization (equivalent to Perplexity Minimization)

Objective Functions:

Purpose: Minimize the total description length of the data.

Formally: min_M [ L(M) + L(X|M) ] where L(M) is model code length and L(X|M) is data code length.

Key Hyperparameters:

alpha: Discount parameter of the Pitman-Yor process (controls the power-law tail)
N: Size of training data
C_knw: Complexity constant for the knowledge model
+ 1 more
C_syn: Complexity constant for the syntax model

Comparison to Prior Work

vs. Kaplan/Hoffmann: Provides a *theoretical derivation* for the scaling exponents based on data distribution properties (Zipf/Heap) rather than just empirical curve fitting.
vs. Allen-Zhu & Li: Adopts a similar synthetic setup for validation but focuses on the information-theoretic bound (Kolmogorov complexity) rather than specific gradient descent dynamics [not cited in paper, but conceptually related].

Limitations

The Syntax-Knowledge model is a simplified abstraction of real natural language.
Theoretical bounds rely on Bayesian coding assumptions which may not perfectly map to SGD-trained neural networks.
Quantitative results (constants) depend on specific parameters of the Pitman-Yor process which are hard to estimate for real text.

📊 Experiments & Results

Evaluation Setup

Theoretical derivation backed by synthetic data experiments and analysis of existing scaling laws.

Benchmarks:

Synthetic Syntax-Knowledge Tasks (Next-token prediction on hierarchically generated data) [New]

Metrics:

Perplexity (PPL)
Cross-Entropy Loss
Redundancy (Expected Regret)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Theoretical bounds derived for data scaling laws.
N/A (Theoretical)	Redundancy Bound	Not reported in the paper	Not reported in the paper	Not reported in the paper
Model scaling behaviors regarding knowledge acquisition.
Synthetic Data	Knowledge Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Comparison of theoretical predictions vs. experimental results for model scaling behaviors.

Accuracy of knowledge elements sorted by their frequency in the training data.

Main Takeaways

Data Scaling: The loss scales as a power law O(N^-(1-α)) dominated by the knowledge component, while the syntax component scales faster as O(1/N).
Model Scaling: Increasing model size allows the capture of the 'tail' of the knowledge distribution. Small models are forced to treat rare knowledge as noise.
Hallucinations: Interpreted as the model using a lower-complexity approximation for the tail of the distribution. The model predicts the 'average' or 'most likely' outcome for a broad class of rare events rather than the specific true fact.
Fine-tuning: Primarily updates the syntax model (instruction following) while retaining the knowledge base acquired during pre-training.

📚 Prerequisite Knowledge

Prerequisites

Kolmogorov Complexity and Structure Functions
Shannon Information Theory (Entropy, KL Divergence)
Bayesian Nonparametrics (Pitman-Yor Process)
Arithmetic Coding

Key Terms

Kolmogorov Structure Function: A function characterizing the compressibility of data given a constraint on the model's complexity (size).

Two-part Code: A compression scheme describing data by first describing the model, then describing the data using that model.

Redundancy: The difference between the expected code length of a model and the fundamental entropy of the data source.

Pitman-Yor Process: A stochastic process used in Bayesian nonparametrics to generate power-law distributed data (like word frequencies in natural language).

Zipf's Law: An empirical law stating that the frequency of a token is inversely proportional to its rank.

Heap's Law: An empirical law describing how the number of distinct vocabulary items grows with the size of the document collection.

Minimal Sufficient Statistics: The simplest model that captures all regularities in the data, leaving only random noise as residual.

Scaling Laws: Empirical relationships predicting model performance (loss) as a power-law function of compute, data size, or parameter count.