Evaluation Setup
Fine-tuning on various downstream tasks including reasoning, translation, and generation
Benchmarks:
- Commonsense Reasoning (8 tasks) (Reasoning (BoolQ, PIQA, SIQA, etc.))
- GSM8K / MATH (Mathematical Reasoning)
- IWSLT14 (De-En, En-De) (Machine Translation)
- DreamBooth (Subject-driven Image Generation)
Metrics:
- Accuracy
- BLEU
- CLIP Score
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Mathematical reasoning results on Llama-3-8B show HOFT and SHOFT outperforming LoRA and DoRA baselines. |
| GSM8K |
Accuracy |
78.2 |
78.6 |
+0.4
|
| MATH |
Accuracy |
39.1 |
39.5 |
+0.4
|
| Commonsense reasoning averages across 8 datasets using Llama-2-7B show slight improvements. |
| Commonsense Avg |
Accuracy |
69.1 |
69.6 |
+0.5
|
| Subject-driven generation using Stable Diffusion shows HOFT achieves better CLIP alignment. |
| DreamBooth |
CLIP Score |
29.8 |
30.4 |
+0.6
|
Main Takeaways
- HOFT and SHOFT consistently match or outperform LoRA and DoRA across reasoning, translation, and generation tasks.
- The use of two orthogonal matrices (HOFT) instead of one (OFT) is validated theoretically and empirically.
- The CWY-based parameterization provides significant speedups over Cayley-based OFT, making orthogonal fine-tuning practical for large models.
- SHOFT (with scaling) generally performs better than pure HOFT, confirming the importance of magnitude updates alongside directional updates.