QuanTA: Efficient High-Rank Fine-Tuning of LLMs with Quantum-Informed Tensor Adaptation

📝 Paper Summary

Parameter-Efficient Fine-Tuning (PEFT) High-Rank Adaptation

QuanTA parameterizes weight updates using a sequence of tensors acting on reshaped input axes—inspired by quantum gates—to achieve high-rank adaptation with fewer parameters than low-rank methods.

Core Problem

Existing methods like LoRA rely on a low-rank hypothesis that fails for complex tasks (like reasoning) where weight updates inherently require high-rank structural changes.

Why it matters:

Complex downstream tasks often have high 'intrinsic rank,' causing low-rank approximations to underperform significantly compared to full fine-tuning
Scaling up model sizes makes full fine-tuning computationally prohibitive, necessitating efficient methods that do not sacrifice expressivity

Concrete Example: In the DROP dataset (discrete reasoning), LoRA's performance gaps persist even when rank is increased, and subspace similarity analysis shows the task requires updating high-rank components that low-rank decompositions cannot capture.

Key Novelty

Quantum-Informed Tensor Adaptation (QuanTA)

Reshapes the hidden dimension vector into a multi-dimensional tensor (analogous to a multi-qubit quantum state)
Applies a sequence of sparse tensors (analogous to quantum gates) that operate only on specific axes of the reshaped input
Constructs a high-rank weight update matrix through the composition of these sparse local tensors, satisfying the universality theorem for matrix representation

Architecture

Conceptual comparison between LoRA and QuanTA structures. Shows LoRA as outer product of low-rank matrices vs QuanTA as tensor contractions on reshaped inputs.

Evaluation Highlights

+5.1% F1 improvement on DROP compared to LoRA (rank=8) using LLaMA2-70B while using ~40% fewer trainable parameters
Outperforms DoRA and LoRA on 8 commonsense reasoning tasks with LLaMA-3-8B (Avg Accuracy: 85.8% vs DoRA's 85.2%)
Achieves parity with or exceeds Full Fine-Tuning (FT) on arithmetic reasoning (GSM8K) using <0.2% of parameters

Breakthrough Assessment

8/10

Offers a theoretically grounded (universality theorem) alternative to the dominant low-rank paradigm. Successfully addresses the high-rank update problem in reasoning tasks with extreme parameter efficiency.

⚙️ Technical Details

Problem Definition

Setting: Parameter-Efficient Fine-Tuning of a pre-trained weight matrix W₀ ∈ ℝ^(d×d)

Inputs: Input hidden vector x ∈ ℝ^d

Outputs: Adapted output vector y = W_θ x

Pipeline Flow

Input Reshaping (View vector x as tensor with shape d₁ × ... × d_N)
Sequential Tensor Applications (Apply T⁽α⁾ to specific axes m, n)
Residual Connection (Add result to frozen base model output)
Initialization Correction (Subtract frozen branch S to ensure identity at start)

System Modules

Base Layer

Compute original features using frozen pre-trained weights

Model or implementation: Frozen Linear Layer W₀

QuanTA Branch (T)

Compute high-rank weight update via tensor contraction

Model or implementation: Sequence of Trainable Tensors {T⁽α⁾}

Subtraction Branch (S)

Cancel out T at initialization to ensure training starts from base model behavior

Model or implementation: Sequence of Frozen Tensors {S⁽α⁾}

Novel Architectural Elements

Tensor-chain parameterization: Parameterizing the update matrix as a product of sparse tensors operating on reshaped input dimensions (inspired by quantum gates)
Initialization strategy: Using a frozen shadow branch S to ensure y = W₀x + Tx - Sx starts at W₀x without forcing T=0 (which would block gradients)

Modeling

Base Model: LLaMA-2 (7B, 13B, 70B) and LLaMA-3 (8B)

Training Method: Supervised Fine-Tuning with QuanTA reparameterization

Adaptation: QuanTA (Typical config: 16-8-8-X tensor shapes)

Trainable Parameters: ~0.01% to 0.26% of total model parameters (varies by config)

Key Hyperparameters:

tensor_structure: N=3 or N=4 factors (e.g., d decomposed into 16×16×16 or 16×8×8×8)
initialization: Kaiming initialization for tensors

Compute: No inference overhead (weights mergeable). Training memory: stores hidden vector size d. Complexity: linear in d (d · ∑ d_m d_n).

Comparison to Prior Work

vs. LoRA: QuanTA enables high-rank updates (verified by rank representation theorem) whereas LoRA is strictly low-rank
vs. KronA: KronA is a special case of QuanTA where tensors apply to distinct axes; QuanTA allows tensors to share/overlap axes for entanglement
vs. LoTA/LoRETTA: These enforce low tensor rank; QuanTA allows high-rank via composition of local full-rank tensors

Limitations

Sequential application of tensors to hidden vectors may underutilize GPU parallelism compared to large matrix multiplications
Optimal selection of tensor axes (hyperparameters) is not yet fully automated
Requires hidden dimension d to be decomposable into factors (e.g., d = d₁ × ... × d_N)
Experiments limited to LLaMA series models

Reproducibility

Code: https://github.com/quanta-fine-tuning/quanta

Code is publicly available at https://github.com/quanta-fine-tuning/quanta. The paper provides theoretical proofs for universality and rank representation in the Appendix. Specific hyperparameters for the tensor shapes (e.g., 16-8-8-4) are provided for reported experiments.

📊 Experiments & Results

Evaluation Setup

Fine-tuning on reasoning-heavy datasets and evaluating on downstream tasks

Benchmarks:

DROP (Discrete Reasoning / Reading Comprehension)
Commonsense170k (eval on BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OBQA) (Commonsense Reasoning)
MATH10K (eval on GSM8K, MAWPS, SVAMP) (Arithmetic Reasoning)

Metrics:

F1 Score
Accuracy
Statistical methodology: Results averaged over multiple experiments (2-4 seeds) with standard deviation reported in figures

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on DROP dataset (high intrinsic rank task) showing QuanTA's superiority over LoRA.
DROP	F1 Score	74.3	79.4	+5.1
DROP	F1 Score	56.2	59.6	+3.4
Commonsense Reasoning benchmarks across multiple LLaMA models.
Commonsense Avg (8 tasks)	Average Accuracy	85.2	85.8	+0.6
Commonsense Avg (8 tasks)	Average Accuracy	82.3	84.8	+2.5
Arithmetic Reasoning (Math) benchmarks showing efficiency.
GSM8K	Accuracy	65.7	67.0	+1.3

Experiment Figures

Subspace similarity heatmaps comparing weight updates between Rank-64 and Rank-128 LoRA experiments for RTE and DROP datasets.

F1 Score on DROP vs. Number of Trainable Parameters (%) for LLaMA2-7B.

Main Takeaways

QuanTA consistently achieves comparable or better performance than Full Fine-Tuning and LoRA/DoRA, particularly on tasks with high intrinsic rank like DROP.
Parameter Efficiency: QuanTA frequently requires 10x-20x fewer trainable parameters than LoRA/DoRA to achieve superior results.
High-Rank Capability: Unlike LoRA, which saturates or degrades on complex tasks, QuanTA scales effectively, validating the theoretical claim of high-rank representation.
Scalability: Performance gains are consistent across model scales (7B to 70B) and architectures (LLaMA-2, LLaMA-3).

📚 Prerequisite Knowledge

Prerequisites

Linear Algebra (Matrix Rank, SVD, Tensor products)
Basic Quantum Computing concepts (Quantum Gates, Qubits)
Parameter-Efficient Fine-Tuning (LoRA)

Key Terms

LoRA: Low-Rank Adaptation—a PEFT method that approximates weight updates as the product of two low-rank matrices

PEFT: Parameter-Efficient Fine-Tuning—techniques to adapt large models by training only a small subset of parameters

Quantum Circuit: A model of computation where a state is manipulated by a sequence of unitary gates; used here as an analogy for tensor operations

Einsum: Einstein Summation—a notation for expressing tensor operations (contractions) concisely

Universality Theorem: A theorem stating that any matrix can be decomposed into a sequence of two-axis tensors (analogous to quantum gates)

Composition Openness: The property that the composition of two matrices from a set may fall outside that set; unlike low-rank matrices, QuanTA satisfies this, allowing expressivity to grow with depth

Intrinsic Rank: The minimum rank required to effectively represent the weight update necessary for a specific downstream task

DoRA: Weight-Decomposed Low-Rank Adaptation—a variant of LoRA that decomposes weights into magnitude and direction