Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

📝 Paper Summary

Parameter-Efficient Fine-Tuning (PEFT) Model Compression

FourierFT compresses fine-tuning updates by learning only a tiny fraction of spectral coefficients in the Fourier domain rather than using low-rank matrices in the spatial domain.

Core Problem

Fine-tuning Large Foundation Models (LFMs) requires storing massive weight updates; even efficient methods like LoRA face storage challenges when scaling to many customized tasks or larger base models.

Why it matters:

Storing fine-tuned checkpoints for many users creates high storage and bandwidth costs for model hubs (e.g., Civitai)
Mobile applications need aggressively compressed weights to load customized models into limited RAM
Current methods like LoRA still require significant parameters (e.g., 33.5M for LLaMA2-7B instruction tuning), limiting scalability

Concrete Example: A LoRA adapter for a specific Stable Diffusion style requires ~40MB. Storing thousands of user-customized styles becomes prohibitively expensive for community platforms.

Key Novelty

FourierFT (Fourier Transform for Fine-Tuning)

Instead of factorizing weight changes into low-rank matrices (like LoRA), treats the weight change matrix as a spatial signal
Learns a very sparse set of coefficients in the frequency domain (Fourier basis) to reconstruct the spatial weight update
Uses a fixed random mask to select which frequencies to learn, significantly reducing trainable parameters compared to rank-decomposition methods

Evaluation Highlights

Surpasses LoRA on LLaMA2-7B instruction tuning with only 0.064M parameters (vs. LoRA's 33.5M), a ~500x reduction
Achieves comparable performance to Full Fine-Tuning on LLaMA2-7B with only 128K parameters
Matches LoRA performance on GLUE benchmark (RoBERTa) using only ~6-9% of LoRA's parameter count

Breakthrough Assessment

7/10

Offers drastic parameter reduction (orders of magnitude) over LoRA while maintaining performance. The approach of using Fourier basis for weight updates is a novel divergence from the dominant low-rank paradigm.

⚙️ Technical Details

Problem Definition

Setting: Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Large Foundation Models

Inputs: Pre-trained weight matrix W_0

Outputs: Updated weight matrix W_0 + ΔW, where ΔW is reconstructed from sparse spectral coefficients

Pipeline Flow

Spectral Entry Selection (Fixed)
Coefficient Learning (Trainable)
Reconstruction (Inverse DFT)
Weight Update application

System Modules

Spectral Entry Matrix (E)

Defines which frequency indices (u, v) will be active/learnable. Shared across layers and frozen.

Model or implementation: Random sampling (uniform or with frequency bias)

Spectral Coefficients (c)

The actual learnable parameters representing the weight update in the frequency domain.

Model or implementation: Trainable vector

Inverse DFT & Scaling

Converts sparse spectral coefficients back to a dense weight update matrix.

Model or implementation: IDFT operation

Novel Architectural Elements

Replacement of low-rank matrices (A, B) with sparse spectral coefficient vector and Inverse DFT operation for weight updates

Modeling

Base Model: Evaluated on RoBERTa (Base/Large), GPT-2 (Medium/Large), LLaMA-2 (7B/13B), and ViT (Base/Large)

Training Method: Parameter-Efficient Fine-Tuning via Sparse Fourier Transform

Adaptation: FourierFT (n=1000 to n=64000 coefficients depending on task)

Trainable Parameters: Varies by task (e.g., 0.024M for RoBERTa-Base GLUE, 0.064M for LLaMA-2-7B Instruct)

Key Hyperparameters:

n (number of coefficients): Typically 1000 for GLUE, varied for others
alpha (scaling factor): Shared scalar value
learning_rate: Varies by experiment
+ 1 more
batch_size: Varies by experiment

Compute: Not reported in the paper

Comparison to Prior Work

vs. LoRA: FourierFT learns spectral coefficients instead of spatial low-rank matrices; achieves higher compression ratios.
vs. BitFit: FourierFT updates weights via reconstruction, offering more expressiveness than just biases.
vs. VeRA [not cited in paper]: VeRA also freezes random matrices and trains vectors, but uses spatial projection rather than Fourier basis.

Limitations

Parameter efficiency advantage relies on the assumption that weight updates are sparse in the frequency domain.
Requires Inverse DFT operation during forward pass (though representable as linear transform, computational overhead is not explicitly analyzed).
No analysis of training time or memory overhead during the training phase itself, only parameter count.

Reproducibility

Code: https://github.com/Chaos96/fourierft

Code is publicly available at https://github.com/Chaos96/fourierft. Paper details initialization and IDFT process. Hyperparameters for specific experiments are provided in text.

📊 Experiments & Results

Evaluation Setup

Fine-tuning pre-trained models on NLU, NLG, Instruction Tuning, and Image Classification tasks.

Benchmarks:

GLUE (Natural Language Understanding)
E2E (Natural Language Generation)
Instruction Tuning (LLaMA-2) (Instruction Following)
Image Classification (ViT) (Computer Vision)

Metrics:

Accuracy
BLEU
NIST
ROUGE-L
METEOR
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Instruction Tuning results on LLaMA-2-7B show FourierFT achieving superior performance with drastically fewer parameters compared to LoRA.
Instruction Tuning (LLaMA-2-7B)	ROUGE-L	42.0	42.4	+0.4
Instruction Tuning (LLaMA-2-7B)	ROUGE-L	42.8	42.8	0.0
GLUE benchmark results demonstrate FourierFT matches LoRA performance with significantly fewer parameters.
GLUE (Avg)	Score	87.52	87.77	+0.25
Computer Vision tasks reinforce the efficiency, showing comparable accuracy with <10% of the parameters.
Image Classification (ViT-Base)	Accuracy	0.68	0.76	+0.08

Main Takeaways

FourierFT consistently matches or exceeds LoRA performance across NLU, NLG, and CV tasks.
Achieves extreme compression rates: ~0.2% of LoRA's parameters for instruction tuning and ~6% for GLUE tasks.
Parameter efficiency advantage increases as model scale grows (e.g., from RoBERTa Base to Large).
Frequency bias analysis suggests different tasks may benefit from learning coefficients in specific frequency bands (low vs. high).

📚 Prerequisite Knowledge

Prerequisites

Discrete Fourier Transform (DFT) and Inverse DFT
Low-Rank Adaptation (LoRA)
Matrix decomposition/reconstruction

Key Terms

LoRA: Low-Rank Adaptation—a PEFT method that approximates weight changes using the product of two small low-rank matrices

FourierFT: Fourier Transform for Fine-Tuning—the proposed method using sparse spectral coefficients to represent weight updates

Spectral Matrix: A matrix in the frequency domain containing learned coefficients at specific indices and zeros elsewhere

Inverse Discrete Fourier Transform (IDFT): Mathematical operation to convert the frequency domain representation back into the spatial (weight) domain

Spatial Domain: The standard representation of neural network weights as matrices of numerical values

Frequency Bias: A sampling strategy where spectral entries are selected based on their distance from the center (low/high frequencies) rather than uniformly at random