← Back to Paper List

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Y. Lin
Georgia Institute of Technology
Neural Information Processing Systems (2024)
Pretraining Memory

📝 Paper Summary

LLM Efficiency Model Compression
ShiftAddLLM reparameterizes pretrained LLM weights into binary matrices and powers-of-two scaling factors without retraining, replacing costly multiplications with efficient bitwise shift-and-add operations to reduce memory and latency.
Core Problem
Deploying LLMs on resource-constrained devices is bottlenecked by high memory demands and costly dense multiplication operations, while existing multiplication-free methods require expensive retraining or fine-tuning.
Why it matters:
  • GPT-3 (175B) requires 350GB memory and 10^15 FLOPs per pass, making edge deployment difficult
  • Standard quantization (e.g., W8A8) still relies on multiplications, which consume significantly more energy and area than shifts and adds
  • Prior multiplication-less methods like ShiftAddNet require training from scratch, which is computationally prohibitive for large foundation models
Concrete Example: In a standard quantized LLM, a weight-activation product involves a costly floating-point multiplication. In ShiftAddLLM, this same operation is approximated by shifting the activation bits (multiplication by power-of-two) and adding selected results, consuming ~88% less energy for an OPT-66B MLP layer.
Key Novelty
Post-Training Shift-and-Add Reparameterization
  • Decomposes pretrained weight matrices into multiple binary matrices and group-wise power-of-two scaling factors using Binary-Coding Quantization (BCQ)
  • Converts matrix multiplications into two steps: (1) bitwise shifts of activations by scaling factors, and (2) queries and additions using the binary matrices via Look-Up Tables (LUTs)
  • Optimizes quantization errors using a multi-objective approach that minimizes both weight reconstruction error and output activation error simultaneously
Evaluation Highlights
  • Achieves average perplexity reductions of 5.6 and 22.7 points at 3-bit and 2-bit precision respectively compared to competitive quantized LLMs (e.g., GPTQ, LUT-GEMM)
  • Reduces energy consumption by >80% compared to original FP16 LLMs across five LLM families
  • Maintains comparable or lower latency than state-of-the-art quantized kernels (LUT-GEMM) while improving accuracy significantly
Breakthrough Assessment
8/10
First post-training multiplication-less reparameterization for LLMs. Successfully bridges the gap between efficient shift-add arithmetic and pretrained foundation models without expensive retraining.
×