Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

📝 Paper Summary

Implicit Bias in Deep Learning Optimization Dynamics Sharpness-Aware Minimization (SAM)

While SAM converges to the same asymptotic direction as Gradient Descent in linear models, in depth-2 networks it exhibits a distinct 'sequential feature amplification' bias where minor features are learned before major ones.

Core Problem

Existing analyses of SAM's implicit bias focus mostly on infinite-time limits or squared loss, failing to explain finite-time behaviors where SAM's trajectory deviates significantly from Gradient Descent (GD).

Why it matters:

Understanding implicit bias is crucial for explaining why over-parameterized networks generalize well
Current infinite-time theories incorrectly suggest SAM and GD always share the same bias in certain settings, missing critical finite-time differences
The observed bias toward minor features (background/noise) in SAM could explain empirical phenomena like robustness or failure modes in real-world vision tasks

Concrete Example: In a 2-layer linear diagonal network trained on a single data point with features (1, 2), Gradient Descent immediately aligns with the major feature (2). However, L2-SAM initially amplifies the minor feature (1) and only shifts to the major feature later, or sometimes never if initialization is small.

Key Novelty

Sequential Feature Amplification in L2-SAM

Identifies a phenomenon where SAM initially relies on minor coordinates (weak features) and gradually shifts to larger ones as training proceeds or initialization increases
Proves that the gradient normalization factor in L2-SAM's perturbation term suppresses major features early in training, allowing minor ones to dominate initially
Demonstrates that for depth L=2, SAM's implicit bias is time-dependent and initialization-dependent, unlike GD which monotonically favors major features

Architecture

Trajectories of predictor beta(t) for GD, L-infinity SAM, and L2-SAM on a 2D toy dataset. Comparing Depth 1 vs Depth 2.

Evaluation Highlights

L-infinity SAM converges to minor features (standard basis vectors) instead of the major feature for a wide range of initializations, unlike GD
L2-SAM on depth-2 networks exhibits three distinct dynamic regimes based on initialization scale: convergence to zero, sequential feature amplification (minor -> major), or immediate major feature alignment
Theoretical lower bounds show minor features can grow to be >10x larger than major features during the transient phase of L2-SAM training

Breakthrough Assessment

8/10

Provides a rigorous theoretical counter-example to the common assumption that infinite-time bias characterizes optimization behavior. The discovery of 'Sequential Feature Amplification' offers a novel, finite-time perspective on SAM.

⚙️ Technical Details

Problem Definition

Setting: Binary classification on linearly separable datasets using L-layer linear diagonal networks

Inputs: Dataset {(x_i, y_i)} with logistic loss

Outputs: Linear predictor beta(theta) defined as the element-wise product of layer weights

Pipeline Flow

Input Data (Linearly Separable)
L-layer Linear Diagonal Network
Logistic Loss Calculation
SAM Perturbation Step (L2 or L-infinity)
Parameter Update (Gradient Descent on Perturbed Loss)

System Modules

Linear Diagonal Network

Model architecture for theoretical analysis

Model or implementation: f(x) = <w(1) ⊙ ... ⊙ w(L), x>

SAM Optimizer

Update rule

Model or implementation: Perturbation epsilon_p(theta) added to weights before gradient calculation

Modeling

Base Model: Linear Diagonal Networks (Depth L=1, L=2, L=3+)

Training Method: Sharpness-Aware Minimization (SAM) Flow

Objective Functions:

Purpose: Minimize worst-case logistic loss in a neighborhood.

Formally: min_theta max_{||epsilon||_p <= rho} L(theta + epsilon)

Training Data:

Single-example dataset {(mu, +1)} for tractable analysis
Synthetic banded data for experiments
MNIST for real-world verification

Key Hyperparameters:

rho: Perturbation radius (e.g., 0.05, 0.1, 1.0)
alpha: Initialization scale (uniform across coordinates)
eta: Step size (for discrete experiments)
+ 1 more
L: Network depth

Compute: Not reported in the paper

Comparison to Prior Work

vs. GD (Depth 1): Both SAM and GD converge to L2 max-margin.
vs. GD (Depth 2+): GD consistently converges to L1 max-margin (major features). L-infinity SAM can converge to minor features depending on initialization. L2-SAM shows transient amplification of minor features before asymptotic convergence.

Limitations

Theoretical analysis primarily focuses on single-data-point settings (though experiments suggest extension to multi-point)
Analysis relies on continuous-time flows rather than discrete updates (though shown to be close)
Restricted to linear diagonal networks, which are simpler than practical deep non-linear networks

Reproducibility

The paper is primarily theoretical with proofs in appendices. Synthetic experiments are fully specified (dataset parameters, initialization, depth). Real-world MNIST experiment details (model architecture, hyperparameters) are provided in Section E.4.

📊 Experiments & Results

Evaluation Setup

Analysis of weight trajectories on synthetic linearly separable data and MNIST classification

Benchmarks:

Single-example Toy Dataset (Binary Classification) [New]
Synthetic Banded Dataset (Binary Classification) [New]
MNIST (Image Classification)

Metrics:

Directional convergence (cosine similarity)
Dominant coordinate index
Amplification ratio (feature/major feature)
Test Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
L-infinity SAM behavior on single-example data (mu = (1, 2)) shows strong dependence on initialization compared to GD.
Single-example (L=2)	Limit Direction	e_2 (Major feature)	e_1 (Minor feature)	-
L2-SAM exhibits Sequential Feature Amplification on Depth-2 networks.
Single-example (L=2)	Dominant Index over Time	Index 5 (Major)	Index 1 -> 2 -> 3 -> 4 -> 5	-
Single-example (L=2)	Amplification Lower Bound (Ratio)	1.0	> 10.0	-
Real-world verification on MNIST using Grad-CAM.
MNIST	Grad-CAM Heatmap Focus	Dominant digit pixels	Background / Minor regions	-

Experiment Figures

Heatmap of the dominant coordinate index for L2-SAM across time (x-axis) and initialization scale (y-axis).

Grad-CAM visualizations for CNNs trained on MNIST using GD vs L2-SAM.

Main Takeaways

Depth-1: SAM and GD share identical implicit bias (L2 max-margin).
Depth-2 (L-infinity SAM): Convergence direction depends critically on initialization relative to perturbation radius; can converge to minor features.
Depth-2 (L2-SAM): Asymptotically aligns with GD (L1 max-margin) if loss vanishes, but exhibits 'Sequential Feature Amplification' in finite time.
Sequential Feature Amplification means the model learns minor features first, then intermediate, then major. Lower initialization scale exacerbates this effect.
Finite-time analysis is essential for SAM because the infinite-time limit hides the distinct trajectory that likely affects generalization in practice.

📚 Prerequisite Knowledge

Prerequisites

Gradient Descent (GD) dynamics
Implicit Bias / Regularization
Linear Diagonal Networks
Logistic Loss / Max-margin classifiers

Key Terms

SAM: Sharpness-Aware Minimization—an optimization algorithm that seeks parameters minimizing loss within a local neighborhood to improve generalization

Linear Diagonal Network: A simplified neural network architecture where the predictor is the element-wise product of weight vectors from L layers (beta = w_1 ⊙ ... ⊙ w_L)

Implicit Bias: The tendency of an optimization algorithm (like GD or SAM) to converge to a specific solution (e.g., minimum norm) among many possible solutions that fit the data equally well

L2-SAM: A variant of SAM where the local neighborhood perturbation is constrained by the L2 norm

L-infinity SAM: A variant of SAM where the local neighborhood perturbation is constrained by the L-infinity norm

Max-margin classifier: A classifier that maximizes the distance (margin) to the nearest data point; L2 max-margin minimizes the L2 norm of weights, L1 max-margin minimizes the L1 norm

Sequential Feature Amplification: The phenomenon observed in this paper where L2-SAM amplifies minor input features early in training before eventually shifting focus to major features

Rescaled flow: A time-reparameterized continuous-time formulation of the optimization dynamics that simplifies analysis by removing the scalar loss derivative term

Directional Convergence: When the parameter vector aligns with a specific direction as its magnitude grows to infinity