From Data Statistics to Feature Geometry: How Correlations Shape Superposition

📝 Paper Summary

Mechanistic Interpretability Feature Superposition Neural Network Geometry

Correlated features in neural networks organize into geometric structures that utilize interference constructively to aid reconstruction, rather than arranging into regular polytopes to suppress interference as noise.

Core Problem

Existing theories of superposition assume features are sparse and uncorrelated, predicting 'interference filtering' geometries (regular polytopes) that fail to explain the semantic clusters and circular structures observed in real language models.

Why it matters:

Current geometric theories cannot account for structures found in real LLMs (e.g., ordered circles of features like months, anisotropic clusters)
Understanding feature geometry is critical for dictionary learning (Sparse Autoencoders), knowledge editing, and adversarial robustness
Treating all interference as noise ignores the efficiency gains available from exploiting data correlations

Concrete Example: Standard superposition theory predicts features should form a regular polytope to minimize dot products (noise). However, real LLMs represent 'months of the year' as a circle. Standard theory cannot explain why this structure is optimal, as it increases dot products between adjacent months.

Key Novelty

Constructive Interference in Linear Superposition

Demonstrates that when data has low-rank covariance, interference from correlated features can align with the target signal rather than acting as noise
Introduces Bag-of-Words Superposition (BOWS), a framework using real text data to study how realistic correlations drive feature geometry beyond idealized independent setups
Identifies that weight decay and tight bottlenecks favor these 'constructive' low-rank solutions over the sparse, interference-filtering solutions predicted by prior work

Architecture

The Bag-of-Words Superposition (BOWS) data pipeline and Autoencoder setup

Evaluation Highlights

Synthetic experiments show ReLU Autoencoders abandon interference-filtering (antipodal) structures for constructive interference (circular) structures when bottlenecks are tight (m < 6 for 12 features)
Formalizes 'linear superposition' where linear decoders can recover features typically thought to require non-linear filtering
Demonstrates that weight decay biases models toward low-rank solutions ($||W||^2_F \approx m$) over sparse solutions ($||W||^2_F \approx d$)

Breakthrough Assessment

7/10

Provides a significant theoretical correction to the standard view of superposition, explaining empirically observed structures (circles, clusters) that prior theories could not. The BOWS framework offers a valuable testbed.

⚙️ Technical Details

Problem Definition

Setting: Reconstructing a high-dimensional feature vector f from a lower-dimensional projection, considering feature correlations

Inputs: Feature vector f in R^d (e.g., binary bag-of-words)

Outputs: Reconstructed feature vector f_hat in R^d

Pipeline Flow

Input Feature Vector f
Encoder (projects to lower dimension m)
Decoder (Linear or ReLU reconstruction)
Output Reconstruction f_hat

System Modules

Encoder

Compress high-dimensional features into a lower-dimensional bottleneck

Model or implementation: Linear transformation W (m x V)

Decoder

Reconstruct the original features from the compressed representation

Model or implementation: ReLU(W^T h + b) or Linear(W^T h + b)

Novel Architectural Elements

Integration of Bag-of-Words (BOWS) data generation pipeline to strictly control ground-truth features while retaining realistic covariance statistics

Modeling

Base Model: Autoencoder with tied weights

Training Method: Supervised reconstruction training

Objective Functions:

Purpose: Minimize reconstruction error between input features and output.

Formally: L = ||f - ReLU(W^T W f + b)||^2_2 (for ReLU AE) or ||f - W^T W f - b||^2_2 (for Linear AE)

Training Data:

WikiText-103
Tokenized, frequent words (V=10,000) kept
Binary bag-of-words vectors constructed from chunks of context size c=20
Training split: 1,621,198 samples; Validation split: 180,133 samples

Key Hyperparameters:

context_size_c: 20
vocab_size_V: 10000

Compute: Not reported in the paper

Comparison to Prior Work

vs. Elhage et al. (2022): Incorporates feature correlations (BOWS) rather than IID/pairwise-correlated synthetic data; explains constructive interference vs. just filtering
vs. Standard SAE interpretation: Argues that 'interference' is not always noise to be removed but can be signal to be used

Limitations

Focuses on autoencoder toy models and small synthetic setups; scaling to full LLMs is theoretical
Relies on the Linear Representation Hypothesis (LRH) holding true for the features of interest
BOWS is a simplified proxy for real language representations (binary bag-of-words ignores word order)

Reproducibility

Code: https://github.com/LucasPrietoAl/correlations-feature-geometry

Code is publicly available on GitHub. Dataset construction (WikiText-103 BOWS) is fully described with specific parameters (V=10k, c=20). Synthetic data generation details (circular covariance) are referenced in Appendix A.

📊 Experiments & Results

Evaluation Setup

Reconstruction of feature vectors under bottleneck constraints

Benchmarks:

Synthetic Circular Data (Autoencoder Reconstruction) [New]
WikiText-103 BOWS (Bag-of-Words Reconstruction) [New]

Metrics:

Reconstruction Loss (MSE)
Feature Geometry (qualitative analysis of weight arrangements)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Synthetic experiments with 12 features having circular covariance structure show how bottleneck size (m) dictates geometry.
Synthetic Circular Data	Geometry Type	Circular (Principal Subspace)	Circular (Constructive Interference)	-
Synthetic Circular Data	Geometry Type	Circular (Principal Subspace)	Antipodal Pairs	-

Experiment Figures

Comparison of geometries learned by Linear vs ReLU AEs on 12-feature circular data across different bottleneck sizes m

Main Takeaways

Constructive interference dominates when bottlenecks are tight or weight decay is high, as it allows lower-norm solutions ($||W||^2 \approx m$) than interference filtering ($||W||^2 \approx d$)
ReLU Autoencoders are not strictly forced to use non-linear filtering; they can and do revert to linear, PCA-like solutions when correlations make interference useful
Real-world feature correlations (like months or semantic clusters) naturally encourage geometries that look like circles or clusters, rather than the regular polytopes predicted by uncorrelated feature theories

📚 Prerequisite Knowledge

Prerequisites

Mechanistic Interpretability (Superposition, Sparse Autoencoders)
Linear Algebra (PCA, Rank, Covariance Matrices)
Autoencoder architectures

Key Terms

Superposition: The phenomenon where neural networks represent more features than they have dimensions by storing them non-orthogonally

Interference: The noise introduced to a feature's reconstruction by the activation of other non-orthogonal features sharing the same subspace

Constructive Interference: A regime where interference from correlated features is positively correlated with the target signal, aiding reconstruction instead of degrading it

BOWS: Bag-of-Words Superposition—a dataset and framework using binary word occurrence vectors from text to study superposition with realistic correlations

Linear Superposition: A regime where superposed features can be recovered with high accuracy using a linear decoder, typically due to low-rank data structure

ReLU: Rectified Linear Unit—an activation function f(x) = max(0, x) used to filter out negative interference

SAE: Sparse Autoencoder—a model used to disentangle superposed representations into interpretable features

Antipodal pairs: A geometric arrangement where two features share a dimension but point in opposite directions (1 and -1), allowing a ReLU to separate them