How Do Language Models Acquire Character-Level Information?

📝 Paper Summary

Interpretability Language Model Pre-training

By manipulating pre-training data and tokenizers, this study reveals that language models acquire character-level knowledge through two distinct mechanisms: statistical artifacts of subword merge rules and semantic associations between word forms and meanings.

Core Problem

Language models are trained on subword tokens and never explicitly see character-level supervision, yet they surprisingly learn spelling and character composition.

Why it matters:

Understanding this mechanism is crucial for explaining how models handle tasks like rhyming, punning, or spell-checking without explicit character input.
Prior work demonstrated *that* models learn characters but failed to explain *how*, leaving a gap in understanding the role of tokenization versus semantic learning.

Concrete Example: A model knows the token 'apple' contains the character 'p', even though 'apple' is a single atomic integer ID to the model. It is unclear if this is due to seeing 'ap' + 'ple' elsewhere (tokenization artifacts) or associating the concept of apple with its spelling (semantic association).

Key Novelty

Causal Disentanglement of Character Acquisition Factors

Isolates factors by pre-training LMs on transformed corpora: 'WordSub' (random strings replacing words to remove semantic links) and 'CharPert' (random character replacement to remove orthography).
Designs a 'controlled tokenizer' with explicit merge rules to mathematically prove how merge priority orders leak character adjacency information into subword statistics.

Architecture

The probing task pipeline: Tokenizer -> Embedding -> MLP -> Classification.

Evaluation Highlights

Merge rules alone allow models to predict character inclusion with 58.2% accuracy (vs 50% random guess) even when all linguistic meaning is stripped.
Orthographic constraints (like 'q' implies 'u') contribute significantly: accuracy drops ~30 points when within-word constraints are removed.
Semantic association is critical: replacing meaningful words with random strings drops character probing accuracy by ~11-13%.

Breakthrough Assessment

7/10

Provides the first systematic, causal explanation for a well-known phenomenon. While the models are small (nanoGPT), the experimental design is rigorous and offers fundamental insights into LM internal mechanics.

⚙️ Technical Details

Problem Definition

Setting: Binary classification probing of pre-trained token embeddings

Inputs: Embedding vector of a single token $w$

Outputs: Binary label $y$ indicating presence/absence of a specific character $\alpha$ in token $w$

Pipeline Flow

Data Manipulation (Original, WordSub, CharPert)
Tokenizer Training (BPE, WordPiece, or Controlled)
LM Pre-training (BERT-Tiny or nanoGPT)
Probing Classifier Training (MLP on frozen embeddings)

System Modules

Tokenizer

Segment text into subwords based on specific rules (BPE/WordPiece/Controlled)

Model or implementation: Hugging Face Tokenizers (BPE/WordPiece)

Language Model

Learn contextual representations from token sequences

Model or implementation: nanoGPT (12 layers, 768 hidden) or BERT-Tiny

Probing Classifier

Predict if a token contains a specific character

Model or implementation: 2-layer MLP (SELU activation, Tanh)

Novel Architectural Elements

Controlled Tokenizer: A tokenizer where merge rules are manually specified (e.g., specific priority for 'ab' vs 'bc') to trace how merge priority influences embedding geometry

Modeling

Base Model: nanoGPT (124M parameters approx, 12 layers, 12 heads, 768 dim)

Training Method: Pre-training from scratch

Objective Functions:

Purpose: Next-token prediction (standard causal language modeling).

Formally: Maximize log P(token_t | tokens_{<t})

Training Data:

FineWeb 10B token sample
WordSub variant: Replaces tokens with random strings
CharPert variant: Replaces characters randomly

Key Hyperparameters:

batch_size: 1280 (effective)
sequence_length: 1024
learning_rate: 6e-4
+ 4 more
weight_decay: 1e-1
warmup_iterations: 2000
total_iterations: 10000
optimizer: AdamW

Compute: 24 hours on single A100 GPU (nanoGPT)

Comparison to Prior Work

vs. Kaushal and Mahowald: Uses length-balanced probing to remove length bias confounding
vs. Edman et al.: Focuses on *causal mechanisms* via pre-training interventions rather than just evaluating existing models
vs. Itzhak and Levy: Investigates the impact of tokenization algorithms (BPE vs Word) directly

Limitations

Analysis restricted to small-scale models (nanoGPT, BERT-Tiny) due to compute constraints
Focuses primarily on English (FineWeb dataset)
Probing accuracy is a proxy for 'information acquisition' but doesn't guarantee the model uses this info for downstream tasks
Binary classification of character presence is a coarse measure of character-level knowledge

Reproducibility

Code availability is not explicitly provided in the paper. Dataset (FineWeb) is public. Experimental details (hyperparameters, probe architecture) are fully described. Tokenizer construction logic is detailed.

📊 Experiments & Results

Evaluation Setup

Probing task: Binary classification of character presence in tokens from pre-trained embeddings.

Benchmarks:

Character Probing (Matched) (Binary Classification) [New]

Metrics:

Accuracy (Micro-averaged across all alphabet characters)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of character probing accuracy across different pre-training data transformations to isolate semantic vs. tokenization factors.
Character Probing (Matched)	Accuracy	53.8	66.7	+12.9
Character Probing (Matched)	Accuracy	58.2	66.7	+8.5
Character Probing (Matched)	Accuracy	61.2	66.7	+5.5
Controlled tokenizer experiments isolating the effect of orthographic constraints and merge rules.
Character Probing (Matched)	Accuracy	50.1	62.4	+12.3
Character Probing (Matched)	Accuracy	50.1	80.3	+30.2

Experiment Figures

Conceptual diagram categorizing factors of character acquisition: Tokenization-dependent (Merge Rules, Orthographic Constraints) vs. Independent (Semantics, Syntax).

Correlation between merge rule strength (ID in merge table) and frequency of character pairs at subword boundaries.

Main Takeaways

Merge rules act as a statistical encoding mechanism: Frequent merges (e.g., 'th' + 'e') create embedding geometries that implicitly encode the constituent characters.
Orthographic constraints are a massive factor: Knowing the previous subword allows the model to guess the characters of the next subword with high accuracy (e.g., 'q' -> 'u').
Semantic associations (form-meaning links) account for roughly 11-13% of character knowledge; purely structural factors (tokenization) account for the rest.
Models learn character info better for shorter subwords compared to longer ones.

📚 Prerequisite Knowledge

Prerequisites

Subword tokenization algorithms (BPE, WordPiece)
Language model pre-training objectives (MLM, Next-token prediction)
Probing classifiers for interpretability

Key Terms

BPE: Byte-Pair Encoding—a tokenization algorithm that iteratively merges frequent character pairs to form subwords

WordPiece: A tokenization algorithm similar to BPE but based on likelihood maximization, commonly used in BERT

probing classifier: A simple model (usually a linear layer or MLP) trained on top of a frozen model's representations to test if specific information is encoded

FineWeb: A high-quality web text dataset used for pre-training

nanoGPT: A small-scale implementation of the GPT architecture, used here for efficient experimentation

WordSub: A data transformation where every vocabulary token is replaced by a consistent random string, preserving syntax but destroying form-meaning correlations

CharPert: A data transformation where every character is randomly replaced, destroying orthographic regularities while keeping token length

controlled tokenizer: A custom tokenizer with manually defined merge rules designed to test specific hypotheses about character adjacency statistics

orthographic constraints: Rules governing allowed character sequences in a language (e.g., 'ck' rarely starts a word)

MLP: Multilayer Perceptron—a basic feedforward neural network used here as the probe