Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias

📝 Paper Summary

Architectural Inductive Bias Position Bias in LLMs Transformer Theory

The U-shaped performance curve in LLMs is a geometric property of the causal decoder architecture with residual connections present at initialization, not merely a learned artifact or positional encoding side-effect.

Core Problem

LLMs exhibit a 'Lost in the Middle' U-shaped performance curve where retrieval is poor in the middle of the context, but the mechanistic cause is debated between learned artifacts and positional encoding decay.

Why it matters:

Engineering efforts to fix this often target symptoms (e.g., modifying RoPE) rather than the root cause
Current theories rely on circular logic using parameters from trained networks, failing to identify architectural priors
The performance degradation significantly limits the effective utility of long-context windows in modern LLMs

Concrete Example: In a multi-document QA task, a model retrieves information perfectly if the relevant document is at the start or end of the prompt, but fails if the exact same document is placed in the middle, regardless of content relevance.

Key Novelty

Geometric Birthright Theory of Position Bias

Models multi-layer causal attention as iterated powers of the Cesàro matrix to derive a closed-form influence density at initialization (Step 0)
Proves causal masking mathematically forces a logarithmic Primacy Tail (attention sinks), while residual connections force an isolated Recency Anchor (delta spike)
Demonstrates that the 'middle' is a factorial dead zone where gradient influence is diluted by causal mixing, creating a structural valley before training begins

Evaluation Highlights

Theoretical continuous density equations achieve Spearman correlation of 0.99 with the empirical Jacobian of untrained 24-layer Qwen2 models
Wasserstein distance between theoretical prediction and empirical initialization is 0.02, confirming the closed-form Cesàro operator captures discrete topology
Empirical Jacobian shape is identical (Spearman correlation 0.99) with or without RoPE at initialization, debunking RoPE as the root cause

Breakthrough Assessment

9/10

Provides the first exact, closed-form proof that 'Lost in the Middle' is an architectural initialization property, resolving conflicting theories and fundamentally shifting the understanding of transformer geometry.

⚙️ Technical Details

Problem Definition

Setting: Theoretical analysis of gradient flow (Jacobian sensitivity) in deep linear causal transformers at initialization

Inputs: Sequence of input tokens of length L

Outputs: Influence density (Jacobian norm) of the final hidden state with respect to input positions

Pipeline Flow

Input Sequence
Linear Causal Attention (Cesàro) + Residual Connections
Iterated Layer Applications (H layers)
Output Hidden State

System Modules

Causal Attention (Routing)

Mixes information from past tokens; modeled as Cesàro matrix M where M_ij = 1/i for j <= i

Model or implementation: Theoretical Operator

Residual Connection (Routing)

Preserves current token information; modeled as Identity matrix I

Model or implementation: Theoretical Operator

Novel Architectural Elements

Theoretical decomposition of the transformer into a linear 'Cesàro + Residual' operator to isolate topological priors

Modeling

Base Model: Qwen2-0.5B (H=24 layers, 896-dimension)

📊 Experiments & Results

Evaluation Setup

Analysis of Input-Output Jacobian norms at initialization and after pretraining

Benchmarks:

NaturalQuestions (NQ) (Multi-document QA (used for pre-trained context analysis))

Metrics:

Jacobian Norm (Influence Density)
Spearman Rank Correlation
Wasserstein Distance
Statistical methodology: Spearman rank correlation and Wasserstein distance to compare theoretical vs. empirical distributions

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Validation of the theoretical model against untrained networks demonstrates near-perfect fit and independence from RoPE.
Qwen2-0.5B (Untrained)	Spearman Correlation	1.0	0.99	-0.01
Qwen2-0.5B (Untrained)	Wasserstein Distance	0.0	0.02	+0.02
Qwen2-0.5B (Untrained)	Spearman Correlation	1.0	0.99	-0.01
NaturalQuestions	Peak-to-Trough Ratio (Log Scale)	100	1000	+900

Experiment Figures

Log-scale plot comparing the Theoretical Continuous Prediction vs. Empirical Qwen2 Jacobian (Step 0) vs. Qwen2 No-RoPE Jacobian.

Jacobian norms for Initialized vs. Pretrained models on NaturalQuestions, including a 'chunked' condition with no separators.

Main Takeaways

The U-shape is an inherent geometric property of causal decoders with residuals, present at Step 0.
RoPE is mathematically irrelevant to the attention distribution at initialization due to rotational symmetry of isotropic Gaussians.
Standard pretraining does not overcome the topological valley; it learns localized spikes (content detection) but the macroscopic dead zone persists.
The 'middle' is a structural dead zone where gradient influence is factorially suppressed relative to the primacy and recency extremes.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention, Residuals, MLP)
Matrix calculus and Jacobian matrices
Linear algebra (Cesàro matrices)

Key Terms

Cesàro matrix: A lower-triangular matrix where non-zero entries in row i are 1/i, representing a cumulative average; models uniform causal attention at initialization

Primacy bias: The tendency of models to attend heavily to the first few tokens; mathematically proven here to be a logarithmic divergence caused by causal masking

Recency bias: The tendency of models to attend to the most recent tokens; proven here to be an isolated delta spike caused by residual connections

RoPE: Rotary Position Embeddings—a method of encoding position by rotating query/key vectors; shown here to be irrelevant to the initialization topology due to rotational symmetry

Jacobian: A matrix of all first-order partial derivatives of a vector-valued function; its norm measures how much the output changes given a change in input

Isotropic Gaussian: A distribution where variance is the same in all directions; used to model random weight initialization

SwiGLU: A specific activation function used in modern LLMs like Qwen2 and Llama

RMSNorm: Root Mean Square Normalization—a normalization technique used in transformers to stabilize training

Spearman correlation: A statistical measure of rank correlation (monotonic relationship) between two variables

Wasserstein distance: A distance measure between probability distributions, also known as Earth Mover's Distance