Spectral Optimization: Optimization methods that constrain update steps based on the spectral norm (largest singular value) of the update matrix, rather than element-wise norms
Newton-Schulz iteration: A matrix iteration method used to approximate the polar decomposition (or matrix sign function) of a matrix, projecting it onto the Stiefel manifold
Shampoo: A second-order optimizer that approximates the Hessian using Kronecker-factored statistics (tensor products of smaller matrices) to capture parameter correlations
Whitening: A linear transformation that decorrelates data and normalizes its variance; here, transforming the gradient space so the local curvature becomes spherical
Isotropic vs. Anisotropic: Isotropic means properties are uniform in all directions; anisotropic means they vary by direction (e.g., curvature in neural nets is highly anisotropic)
Stiefel Manifold: The set of matrices with orthonormal columns; constraining updates here ensures directional stability
Kronecker-factored statistics: Approximating a large matrix (like the Hessian) as the Kronecker product of two smaller matrices to save memory and compute
Muon: A momentum-orthogonalized optimizer that updates parameters using Newton-Schulz iterations to enforce spectral constraints
SOAP: An optimizer combining Shampoo preconditioning with Adam-style momentum and adaptive step sizes
Pareto frontier: The set of optimal trade-offs where no metric can be improved without degrading another (e.g., training speed vs. final loss)