Optimizer States: Auxiliary data stored by algorithms like Adam (e.g., momentum, variance) to guide training; often larger than the model itself
LoRA: Low-Rank Adaptation—a technique that freezes main weights and trains small rank-decomposition matrices
SVD: Singular Value Decomposition—a mathematical method to factorize a matrix, used here to find the principal directions of the gradient
BF16: Brain Floating Point 16—a reduced-precision numerical format widely used in deep learning
Subspace Learning: Optimizing model weights within a lower-dimensional space rather than the full parameter space
Reversible Networks: Neural network architectures where inputs can be reconstructed from outputs, allowing specific gradient structure analysis
PSD: Positive Semi-Definite—a property of matrices (like the covariance matrices in Adam) ensuring non-negative eigenvalues