Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

📝 Paper Summary

Parameter-Efficient Fine-Tuning (PEFT) Optimization for Large Language Models

The paper introduces a computationally cheap preconditioner for LoRA based on Riemannian geometry that stabilizes training and eliminates the need for manual learning rate tuning between low-rank matrices.

Core Problem

Standard LoRA training can be unstable and requires careful hyperparameter tuning because the two low-rank matrices (A and B) require significantly different learning rates to learn features effectively.

Why it matters:

Tuning learning rates for large foundation models is computationally expensive and time-consuming
Existing heuristics (like setting one learning rate much larger than the other) are brittle and lack theoretical guarantees
Unstable feature learning can lead to poor convergence or degraded performance in downstream tasks

Concrete Example: In standard LoRA, if you use the same learning rate for matrices A and B, feature learning may fail or vanish. Recent work (LoRA+) suggests setting the learning rate for B to be $16\times$ larger than A, but this is just a heuristic. This paper's method works with equal learning rates.

Key Novelty

Riemannian Preconditioned LoRA

Treats the optimization of LoRA parameters not as standard Euclidean optimization, but as optimization on a low-rank matrix manifold using a novel Riemannian metric
Derives a specific $r \times r$ preconditioner (where $r$ is the small rank) that scales gradients to correct the imbalance between matrices A and B
The preconditioner acts as a gradient projector, effectively aligning updates with the column space of B and row space of A

Architecture

Pseudocode for the Riemannian Preconditioned LoRA (Scaled AdamW) optimization step.

Evaluation Highlights

Achieves consistent convergence and performance improvement across GPT-2, RoBERTa, and Llama-2-7b fine-tuning tasks compared to standard LoRA and LoRA+
Significantly improves image generation quality and training stability in text-to-image diffusion models compared to unscaled AdamW
Eliminates the need for separate learning rate tuning: achieves stable learning with equal learning rates for A and B, unlike standard LoRA which requires $\eta_B \gg \eta_A$

Breakthrough Assessment

7/10

Provides a solid theoretical grounding (Riemannian geometry) for a practical problem in PEFT. The resulting algorithm is simple, virtually zero-cost, and robust, though it largely refines existing LoRA rather than proposing a new paradigm.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning a pre-trained weight matrix $W$ of dimension $m \times n$ by learning low-rank updates $A$ and $B$

Inputs: Pre-trained model weights, training data (text or image-text pairs)

Outputs: Fine-tuned low-rank matrices $A$ ($r \times n$) and $B$ ($m \times r$)

Pipeline Flow

Compute Gradients for LoRA matrices A and B
Compute Preconditioners ($r \times r$ matrices)
Scale Gradients using Preconditioners
Update Weights (Standard Optimizer Step)

System Modules

Gradient Computation (Optimization)

Calculate standard gradients of the loss with respect to A and B

Model or implementation: Standard Backpropagation

Preconditioner Calculation (Optimization)

Compute the specific Riemannian scaling matrices

Model or implementation: Linear Algebra Operations

Gradient Scaling (Optimization)

Apply the preconditioners to the gradients

Model or implementation: Matrix Multiplication

Novel Architectural Elements

Injection of $r \times r$ Riemannian preconditioners into the standard SGD/Adam update loop for LoRA parameters

Modeling

Base Model: Evaluated on GPT-2 (medium/large), RoBERTa-large, Llama-2-7b, and Stable Diffusion v1-4

Training Method: Riemannian Preconditioned LoRA (RP-LoRA)

Objective Functions:

Purpose: Minimize task-specific loss (e.g., cross-entropy or MSE) while stabilizing updates via preconditioning.

Formally: Update $A$ using direction $(B^TB)^{-1}\nabla_A\mathcal{L}$ and $B$ using direction $\nabla_B\mathcal{L}(AA^T)^{-1}$.

Adaptation: LoRA (Low-Rank Adaptation) with rank $r$ (typically 4 or 8)

Trainable Parameters: Matrices A and B only (rest of model frozen)

Key Hyperparameters:

learning_rate: Varied (robust to choices, e.g., 1e-3, 5e-4)
rank: Typically 4 or 8
batch_size: Varied by task
+ 1 more
weight_decay: 0.01 (often)

Compute: Negligible overhead compared to standard LoRA; inversion of $r \times r$ matrix is fast

Comparison to Prior Work

vs. LoRA: Introduces dynamic preconditioning matrices $(A A^T)^{-1}$ and $(B^T B)^{-1}$ to scale gradients
vs. LoRA+: Achieves stability via adaptive matrix preconditioning rather than static scalar learning rate ratios; theoretically derived from geometry rather than heuristics
vs. Natural Gradient Descent [not cited in paper]: NGD uses the Fisher Information Matrix (approx Hessian) which is $d \times d$, whereas this method uses small $r \times r$ preconditioners specific to the low-rank structure

Limitations

Requires inverting matrices $A A^T$ and $B^T B$; if these become singular/ill-conditioned, regularization (damping) is needed
Theoretical analysis assumes infinite-width networks, though empirical results hold for finite models
Evaluation limited to standard NLP and Vision fine-tuning benchmarks; extreme low-rank or high-rank behavior not fully explored

Reproducibility

Code: https://github.com/pilancilab/Riemannian_Preconditioned_LoRA

Code is publicly available at https://github.com/pilancilab/Riemannian_Preconditioned_LoRA. The paper provides theoretical proofs in appendices and implementation details (Algorithm 1) for integrating the preconditioner into AdamW.

📊 Experiments & Results

Evaluation Setup

Fine-tuning Large Language Models and Diffusion Models on downstream tasks

Benchmarks:

E2E (Natural Language Generation (data-to-text))
WebNLG (Natural Language Generation (data-to-text))
DART (Natural Language Generation (data-to-text))
GLUE (MNLI, SST-2, CoLA, QNLI) (Natural Language Understanding)
Pokemon Blip Captions (Text-to-Image Generation)

Metrics:

BLEU
NIST
METEOR
ROUGE-L
CIDEr
Accuracy
Training Loss
FID (Fréchet Inception Distance)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GPT-2 Medium fine-tuning on E2E dataset shows RP-LoRA generally matching or exceeding baselines while being more robust.
E2E (GPT-2 Medium)	BLEU	68.2	68.9	+0.7
E2E (GPT-2 Medium)	NIST	8.62	8.74	+0.12
GLUE benchmark results with RoBERTa-large show competitive performance.
MNLI (RoBERTa-large)	Accuracy	90.6	90.6	0.0
SST-2 (RoBERTa-large)	Accuracy	94.8	95.1	+0.3
Diffusion model fine-tuning demonstrates significant convergence speedups.
Pokemon Blip Captions	Training Loss (approx at step 500)	0.15	0.05	-0.10

Experiment Figures

Comparison of image generation quality (Stable Diffusion fine-tuning) between AdamW and Scaled AdamW at different learning rates.

Training loss curves for diffusion model fine-tuning.

Main Takeaways

Preconditioning enables stable feature learning with equal learning rates for A and B, eliminating the need for the $\eta_B \gg \eta_A$ heuristic required by LoRA+.
The method is robust to learning rate variations; Figure 1 shows consistent image generation quality across different learning rates where unscaled optimization fails.
Computational overhead is negligible; runtimes for scaled vs unscaled optimizers are nearly identical (e.g., fine-tuning GPT-2).
Gradient scaling effectively projects updates onto the row space of A and column space of B, approximating full fine-tuning geometry.

📚 Prerequisite Knowledge

Prerequisites

Low-Rank Adaptation (LoRA)
Gradient Descent / Adam Optimizer
Basic Matrix Algebra (SVD, rank)
Riemannian Geometry concepts (manifolds, tangent spaces)

Key Terms

LoRA: Low-Rank Adaptation—a technique to fine-tune large models by freezing weights and training small rank-decomposition matrices

Preconditioner: A transformation applied to gradients (usually multiplying by a matrix) to improve the convergence speed and stability of optimization

Riemannian optimization: Optimization techniques that respect the geometry of the underlying curved space (manifold) where parameters live, rather than assuming a flat Euclidean space

Quotient manifold: A type of manifold where points that represent the same object (e.g., due to rotation invariance) are treated as equivalent

Scaled GD: Gradient Descent where the gradient is scaled by a preconditioner matrix before the update step

Infinite-width NN: A theoretical framework for analyzing neural networks where the number of neurons in hidden layers approaches infinity, used to study convergence properties

Stable feature learning: A regime where neural network updates and outputs remain constant in magnitude as the network width increases, preventing exploding or vanishing signals