← Back to Paper List

Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

Shubham Aggarwal, Lokendra Kumar
arXiv (2026)
Pretraining Memory Benchmark

📝 Paper Summary

Efficient Transformers Model Compression Sparse Matrices
The paper replaces the dense output projection in multi-head attention with a fixed, parameter-free Walsh-Hadamard Transform followed by learnable scaling, significantly reducing parameters and compute without accuracy loss.
Core Problem
The dense output projection matrix in multi-head attention scales quadratically with model dimension ($d^2$), consuming ~25% of attention parameters and creating a memory/compute bottleneck.
Why it matters:
  • As models scale, the quadratic growth of projection layers contributes disproportionately to parameter bloat and memory footprint
  • Attention heads often exhibit high redundancy, suggesting that fully dense, unconstrained mixing matrices are computationally wasteful
  • Memory-bandwidth bottlenecks in large-scale inference (especially decoding) are exacerbated by loading these massive dense matrices
Concrete Example: In a standard Transformer block, if the model dimension is large (e.g., 4096), the output projection requires a 4096×4096 matrix multiplication. This dense operation must load ~16 million parameters from memory for every single token, slowing down decoding even if the attention heads themselves computed redundant information.
Key Novelty
Hadamard-based Attention Output Projection
  • Replaces the learned dense mixing matrix with a fixed Walsh-Hadamard Transform (WHT), which mixes information across heads using only additions and subtractions (butterfly structure)
  • Applies a lightweight, learnable affine rescaling (diagonal scaling) after the transform to recover expressivity while keeping the mixing operation parameter-free
  • Exploits the $O(d \log d)$ complexity of the Fast Walsh-Hadamard Transform compared to the $O(d^2)$ complexity of standard matrix multiplication
Evaluation Highlights
  • Reduces attention parameters by approximately 25% per block compared to standard multi-head attention
  • Achieves 8.9% peak memory savings during inference, enabling larger batch sizes on hardware-constrained devices
  • Improves throughput by 6.6% on XXL-scale models (largest evaluated configuration) due to reduced memory traffic
Breakthrough Assessment
7/10
A mathematically elegant structural replacement for a major Transformer bottleneck. While gains are moderate (6-9%), the removal of $O(d^2)$ parameters without accuracy loss is a significant architectural efficiency finding.
×