GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

📝 Paper Summary

Memory-Efficient Training Parameter-Efficient Fine-Tuning (PEFT)

GaLore reduces memory usage by projecting weight gradients into a low-rank form for the optimizer while keeping the main model weights full-rank, enabling 7B model training on consumer GPUs.

Core Problem

Training LLMs requires massive memory for optimizer states (often 2-3x the model size), making pre-training impossible on consumer hardware. Existing solutions like LoRA restrict the parameter search space, hurting pre-training performance.

Why it matters:

Pre-training a LLaMA 7B model typically requires ~58GB of memory (14GB weights + 42GB optimizer states), exceeding the 24GB capacity of consumer GPUs like the RTX 4090.
Low-rank adaptation methods (LoRA) change training dynamics and often underperform full-rank training or require a full-rank warm-up phase.

Concrete Example: Training a 7B parameter model with Adam requires storing momentum and variance matrices matching the model size. On a 24GB NVIDIA RTX 4090, this causes Out-Of-Memory errors immediately, whereas GaLore compresses these states to fit within memory.

Key Novelty

Gradient Low-Rank Projection (GaLore)

Instead of restricting the weight matrices to be low-rank (like LoRA), GaLore observes that the *gradients* naturally become low-rank during training.
It projects the gradient matrix into a small low-rank subspace before the optimizer step, drastically reducing the size of optimizer states (momentum/variance), then projects the update back to full rank for the weight update.

Architecture

The step-by-step training loop of GaLore.

Evaluation Highlights

Reduces optimizer state memory by up to 65.5% compared to full-rank BF16 baselines during LLaMA pre-training.
Enables pre-training a LLaMA 7B model on a single 24GB consumer GPU (NVIDIA RTX 4090) without model parallelism or offloading.
Achieves a GLUE score of 85.89 when fine-tuning RoBERTa-Base, outperforming LoRA's score of 85.61.

Breakthrough Assessment

9/10

Significantly lowers the barrier to entry for LLM pre-training by making it feasible on consumer hardware without performance compromises common in prior methods like LoRA.

⚙️ Technical Details

Problem Definition

Setting: LLM Pre-training and Fine-tuning with limited GPU memory

Inputs: Training corpus (e.g., C4) or downstream task data (e.g., GLUE)

Outputs: Updated full-rank model weights W

Pipeline Flow

Standard Transformer Architecture (Inference/Forward Pass is unchanged)

System Modules

Transformer Layer

Standard processing unit of the LLM

Model or implementation: LLaMA (1B/7B) or RoBERTa

Novel Architectural Elements

No architectural changes to the model inference pipeline.
Novel training-time architecture: Dynamic injection of Projection Matrices (P and Q) into the optimizer update step.

Modeling

Base Model: LLaMA 1B, LLaMA 7B, RoBERTa-Base

Training Method: GaLore (Gradient Low-Rank Projection)

Objective Functions:

Purpose: Pre-training language modeling.

Formally: Minimize negative log-likelihood of next token.
Purpose: Optimize weights via projected gradients.

Formally: Update rule involves projecting gradient G into P^T G Q, updating optimizer states on this small matrix, then projecting back.

Training Data:

C4 dataset (up to 19.7B tokens for pre-training)
GLUE benchmark (for fine-tuning)

Key Hyperparameters:

GaLore rank: Specify rank r (e.g., r=4 for RoBERTa fine-tuning)
Update Frequency: Subspace update frequency T (e.g., every 200 iterations)
Scale factor: alpha (similar to LoRA scaling)
+ 1 more
Optimizers: AdamW, 8-bit Adam, Adafactor

Compute: Feasible on NVIDIA RTX 4090 (24GB) for LLaMA 7B pre-training.

Comparison to Prior Work

vs. LoRA: GaLore allows learning in the full-rank space eventually (by updating subspaces), whereas LoRA is confined to a fixed subspace.
vs. ReLoRA: GaLore does not require a full-rank warm-up phase.
vs. Gradient Checkpointing: GaLore reduces optimizer memory, whereas checkpointing reduces activation memory (orthogonal and compatible techniques).

Limitations

Updating the projection matrices (P and Q) via SVD incurs a small computational overhead.
Requires selecting the rank hyperparameter r.
Theoretical convergence proofs rely on specific assumptions about gradient structure (e.g., reversible networks).

Reproducibility

Code is stated to be provided in the link (implicit reference to a repository). The paper provides detailed algorithms (Algorithm 1) and hyperparameter settings for reproduction. LLaMA and C4 are public.

📊 Experiments & Results

Evaluation Setup

Pre-training from scratch on C4 and Fine-tuning on GLUE.

Benchmarks:

C4 (Language Modeling (Pre-training))
GLUE (Natural Language Understanding (Fine-tuning))

Metrics:

Perplexity
GLUE Average Score
Optimizer Memory Usage (GB)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fine-tuning results on GLUE demonstrate GaLore achieves superior accuracy compared to standard Low-Rank Adaptation.
GLUE (RoBERTa-Base)	Average Score	85.61	85.89	+0.28
Memory efficiency results highlighting drastic reductions in optimizer state footprint.
LLaMA Pre-training	Optimizer Memory Reduction	0	65.5	-65.5%
LLaMA Pre-training	Total Training Memory Reduction	0	63.3	-63.3%

Experiment Figures

Feasibility of training LLaMA 7B on consumer GPUs.

Performance compatibility with different optimizers.

Main Takeaways

GaLore allows full-parameter learning efficiency while maintaining the memory footprint of low-rank methods.
8-bit GaLore combined with layer-wise updates brings optimizer state memory cost down to less than 10% of the baseline.
The method is optimizer-agnostic and works with AdamW, 8-bit Adam, and Adafactor.
Unlike ReLoRA, GaLore maintains low memory usage throughout the entire training process without needing high-memory warm-up phases.

📚 Prerequisite Knowledge

Prerequisites

Backpropagation
Adam Optimizer
Low-Rank Matrix Approximation (SVD)
Gradient Descent

Key Terms

Optimizer States: Auxiliary data stored by algorithms like Adam (e.g., momentum, variance) to guide training; often larger than the model itself

LoRA: Low-Rank Adaptation—a technique that freezes main weights and trains small rank-decomposition matrices

SVD: Singular Value Decomposition—a mathematical method to factorize a matrix, used here to find the principal directions of the gradient

BF16: Brain Floating Point 16—a reduced-precision numerical format widely used in deep learning

Subspace Learning: Optimizing model weights within a lower-dimensional space rather than the full parameter space

Reversible Networks: Neural network architectures where inputs can be reconstructed from outputs, allowing specific gradient structure analysis

PSD: Positive Semi-Definite—a property of matrices (like the covariance matrices in Adam) ensuring non-negative eigenvalues