HOFT: Householder Orthogonal Fine-tuning

📝 Paper Summary

Parameter-Efficient Fine-Tuning (PEFT) Orthogonal Fine-Tuning

HOFT parameterizes fine-tuning updates using two efficient orthogonal matrices constructed via Householder transformations to preserve hyperspherical energy while reducing computational complexity.

Core Problem

Existing orthogonal fine-tuning methods like OFT and BOFT are computationally expensive and memory-intensive due to matrix inversions and inefficient parameterizations, making them difficult to scale.

Why it matters:

Orthogonal fine-tuning preserves pre-trained knowledge better than low-rank methods by maintaining hyperspherical energy, but its cost hinders adoption
Current orthogonal methods (OFT, BOFT) struggle to balance expressivity (covering the full orthogonal group) with runtime efficiency
Single-matrix orthogonal adaptation solves the Procrustes problem imperfectly; full expressivity requires two orthogonal matrices

Concrete Example: When adapting a pre-trained matrix M, methods like OFT multiply by only one orthogonal matrix Q (M' = QM). This fails to capture all possible adapted matrices that preserve singular values. HOFT uses two matrices (M' = Q_U M Q_V) to fully cover the solution space.

Key Novelty

Double-sided Householder Orthogonal Fine-tuning (HOFT)

Constructs two orthogonal matrices (Q_U and Q_V) using accumulated Householder reflections to adapt weights from both sides, ensuring full expressivity
Uses the CWY transform with a fast Neumann series approximation for matrix inversion, reducing complexity from cubic to linear/quadratic in rank r
Introduces SHOFT (Scaled HOFT), which adds a learnable magnitude vector between the orthogonal transformations to decouple direction and magnitude updates

Evaluation Highlights

+1.4 accuracy on GSM8K (Llama-3-8B) compared to LoRA, and +0.4 compared to DoRA
Achieves 2x-3x speedup in training time compared to standard Orthogonal Fine-tuning (OFT) while maintaining or exceeding performance
SHOFT outperforms DoRA on commonsense reasoning tasks (avg +0.5%) with fewer trainable parameters

Breakthrough Assessment

7/10

Strong theoretical grounding for using two orthogonal matrices. The efficient CWY inversion approximation makes orthogonal fine-tuning practical, offering a competitive alternative to LoRA/DoRA.

⚙️ Technical Details

Problem Definition

Setting: Parameter-efficient fine-tuning of pre-trained foundation models

Inputs: Pre-trained weight matrix M in R^{m x n}

Outputs: Adapted weight matrix M_hat preserving hyperspherical energy (HOFT) or with scaled magnitudes (SHOFT)

Pipeline Flow

Input Activation
Right Orthogonal Transform (Q_V)
Pre-trained Weight Matrix (M) / Scaling (m)
Left Orthogonal Transform (Q_U)
Output Activation

System Modules

Right Orthogonal Adapter (Q_V) (Adaptation)

Rotates the input space before the pre-trained weights

Model or implementation: Parameterized by r Householder vectors via CWY transform

Pre-trained Weights (M)

Applies the original frozen model weights

Model or implementation: Frozen Linear Layer

Scaling Vector (m) [SHOFT only] (Adaptation)

Rescales the magnitude of column vectors, decoupling direction from magnitude

Model or implementation: Learnable vector in R^m

Left Orthogonal Adapter (Q_U) (Adaptation)

Rotates the output space after the weights

Model or implementation: Parameterized by r Householder vectors via CWY transform

Novel Architectural Elements

Double-sided orthogonal adaptation: M_hat = Q_U M Q_V instead of M_hat = Q M
Efficient CWY-based parameterization with Neumann series inverse approximation for fast forward/backward passes

Modeling

Base Model: Llama-2-7B, Llama-3-8B, RoBERTa-large, Stable Diffusion

Training Method: Parameter-Efficient Fine-Tuning (PEFT)

Objective Functions:

Purpose: Standard task-specific loss (e.g., cross-entropy for language modeling).

Formally: L(theta)

Adaptation: HOFT (r=4 or r=8) or SHOFT

Trainable Parameters: Approximately 0.05% to 0.4% of total parameters (comparable to LoRA)

Key Hyperparameters:

rank_r: 4 or 8 (number of Householder vectors)
learning_rate: 2e-4 to 3e-4 (varies by task)
batch_size: 16 or 32
+ 1 more
Neumann_series_terms: 2 (for inverse approximation)

Compute: Complexity O(mn + (m+n)r^2) vs O(mn + m b^2) for OFT. Significant speedup reported.

Comparison to Prior Work

vs. OFT: HOFT uses two matrices for full expressivity and CWY transform for speed [OFT uses one matrix and Cayley transform]
vs. HRA: HOFT constructs hard orthogonal constraints via CWY; HRA uses soft constraints via loss penalties
vs. DoRA: SHOFT applies scaling between two orthogonal transforms to better align with SVD dynamics [DoRA applies scaling after one low-rank update]
+ 1 more
vs. BOFT: HOFT offers full O(m) coverage with fewer parameters/layers compared to butterfly structures

Limitations

Inverse approximation error grows with rank r (though slowly)
Requires computation of two orthogonal matrices per layer, which adds FLOPs compared to simple LoRA inference (though can be merged)
Strict orthogonality might be too restrictive for some tasks (addressed partially by SHOFT)

Reproducibility

No code URL provided in abstract or introduction. Method is described mathematically with complexity analysis. Hyperparameters for experiments are listed.

📊 Experiments & Results

Evaluation Setup

Fine-tuning on various downstream tasks including reasoning, translation, and generation

Benchmarks:

Commonsense Reasoning (8 tasks) (Reasoning (BoolQ, PIQA, SIQA, etc.))
GSM8K / MATH (Mathematical Reasoning)
IWSLT14 (De-En, En-De) (Machine Translation)
DreamBooth (Subject-driven Image Generation)

Metrics:

Accuracy
BLEU
CLIP Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Mathematical reasoning results on Llama-3-8B show HOFT and SHOFT outperforming LoRA and DoRA baselines.
GSM8K	Accuracy	78.2	78.6	+0.4
MATH	Accuracy	39.1	39.5	+0.4
Commonsense reasoning averages across 8 datasets using Llama-2-7B show slight improvements.
Commonsense Avg	Accuracy	69.1	69.6	+0.5
Subject-driven generation using Stable Diffusion shows HOFT achieves better CLIP alignment.
DreamBooth	CLIP Score	29.8	30.4	+0.6

Main Takeaways

HOFT and SHOFT consistently match or outperform LoRA and DoRA across reasoning, translation, and generation tasks.
The use of two orthogonal matrices (HOFT) instead of one (OFT) is validated theoretically and empirically.
The CWY-based parameterization provides significant speedups over Cayley-based OFT, making orthogonal fine-tuning practical for large models.
SHOFT (with scaling) generally performs better than pure HOFT, confirming the importance of magnitude updates alongside directional updates.

📚 Prerequisite Knowledge

Prerequisites

Linear Algebra (SVD, Orthogonal matrices)
Householder transformations
Parameter-Efficient Fine-Tuning (LoRA, DoRA)

Key Terms

CWY transform: A method to efficiently represent the product of multiple Householder matrices as a rank-k update: I - U S U^T

Householder transformation: A linear transformation that describes a reflection about a plane or hyperplane containing the origin

Hyperspherical energy: A measure of the geometric distribution of neurons/weights on a hypersphere; preserving it helps retain pre-trained knowledge

Neumann Series: A series expansion used here to approximate the inverse of a matrix (I + A)^{-1} ~ I - A + A^2 ... without explicit inversion

Orthogonal Procrustes Problem: The problem of finding an orthogonal matrix that best maps one set of points to another; relevant for aligning pre-trained and fine-tuned weights

LoRA: Low-Rank Adaptation—approximating weight updates with low-rank matrices A and B

DoRA: Weight-Decomposed Low-Rank Adaptation—decoupling magnitude and direction updates for better learning stability