Evaluation Setup
Theoretical analysis of optimization dynamics, validated on synthetic/real binary classification tasks.
Benchmarks:
- Synthetic Data (Binary Classification with Optimal Tokens) [New]
Metrics:
- Parameter convergence rate (distance to max-margin solution)
- Attention map sparsification rate
- Training loss convergence rate
- Statistical methodology: Mathematical Proof
Main Takeaways
- Normalized Gradient Descent (NGD) enables global convergence to the max-margin solution for self-attention, removing the dependence on initialization seen in standard GD.
- The attention mechanism exhibits a strong implicit bias towards sparsity, selecting single 'optimal' tokens exponentially fast with respect to time.
- Jointly training the decoder and attention weights is slower (logarithmic rate) due to the non-smooth objective landscape, but still converges globally.
- The theoretical rates align with the intuition that adaptive step sizes accelerate convergence in non-convex transformer landscapes.