ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

📝 Paper Summary

LLM Efficiency Model Compression

ShiftAddLLM reparameterizes pretrained LLM weights into binary matrices and powers-of-two scaling factors without retraining, replacing costly multiplications with efficient bitwise shift-and-add operations to reduce memory and latency.

Core Problem

Deploying LLMs on resource-constrained devices is bottlenecked by high memory demands and costly dense multiplication operations, while existing multiplication-free methods require expensive retraining or fine-tuning.

Why it matters:

GPT-3 (175B) requires 350GB memory and 10^15 FLOPs per pass, making edge deployment difficult
Standard quantization (e.g., W8A8) still relies on multiplications, which consume significantly more energy and area than shifts and adds
Prior multiplication-less methods like ShiftAddNet require training from scratch, which is computationally prohibitive for large foundation models

Concrete Example: In a standard quantized LLM, a weight-activation product involves a costly floating-point multiplication. In ShiftAddLLM, this same operation is approximated by shifting the activation bits (multiplication by power-of-two) and adding selected results, consuming ~88% less energy for an OPT-66B MLP layer.

Key Novelty

Post-Training Shift-and-Add Reparameterization

Decomposes pretrained weight matrices into multiple binary matrices and group-wise power-of-two scaling factors using Binary-Coding Quantization (BCQ)
Converts matrix multiplications into two steps: (1) bitwise shifts of activations by scaling factors, and (2) queries and additions using the binary matrices via Look-Up Tables (LUTs)
Optimizes quantization errors using a multi-objective approach that minimizes both weight reconstruction error and output activation error simultaneously

Evaluation Highlights

Achieves average perplexity reductions of 5.6 and 22.7 points at 3-bit and 2-bit precision respectively compared to competitive quantized LLMs (e.g., GPTQ, LUT-GEMM)
Reduces energy consumption by >80% compared to original FP16 LLMs across five LLM families
Maintains comparable or lower latency than state-of-the-art quantized kernels (LUT-GEMM) while improving accuracy significantly

Breakthrough Assessment

8/10

First post-training multiplication-less reparameterization for LLMs. Successfully bridges the gap between efficient shift-add arithmetic and pretrained foundation models without expensive retraining.

⚙️ Technical Details

Problem Definition

Setting: Post-training compression of Large Language Models (LLMs) to eliminate multiplication operations

Inputs: Pretrained LLM weights W

Outputs: Reparameterized weights composed of binary matrices B and power-of-two scaling factors alpha

Pipeline Flow

Weight Decomposition (BCQ)
Multi-Objective Optimization
Automated Bit Allocation
Inference Kernel (Shift+LUT+Add)

System Modules

BCQ Quantizer (Reparameterization)

Decompose weights into binary matrices and scaling factors

Model or implementation: Optimization algorithm (Greedy + Alternating Minimization)

PoT Quantizer (Reparameterization)

Quantize scaling factors to Powers-of-Two

Model or implementation: Additive PoT

Bit Allocator

Determine optimal bit-width per layer based on sensitivity

Model or implementation: Integer Linear Programming

Shift-Add Kernel

Execute reparameterized matrix multiplication

Model or implementation: Custom CUDA Kernel

Novel Architectural Elements

Post-training integration of shift-and-add arithmetic into Transformer blocks without retraining
Multi-objective optimization aligning weight and activation quantization errors via column/block-wise scaling
Automated sensitivity-based bit allocation specifically for shift-add reparameterization

Modeling

Base Model: Llama-2, Llama-1, OPT, BLOOM, ResNet (evaluated on various sizes e.g., OPT-66B, Llama-2-70B)

Training Method: Post-Training Quantization (No gradient updates to model parameters, only quantization parameters optimized)

Objective Functions:

Purpose: Minimize reconstruction error of weights and output activations.

Formally: minimize ||W - W_q||^2 and ||WX - W_qX||^2

Adaptation: Quantization of weights to Binary + PoT factors; Activations remain FP16

Trainable Parameters: None (Optimization of quantization parameters only)

Compute: Significant reductions in FLOPs replaced by Shift-Adds. >80% energy reduction reported vs FP16.

Comparison to Prior Work

vs. GPTQ: ShiftAddLLM replaces multiplications with shifts/adds, whereas GPTQ quantizes weights but still uses dequantization+multiplication or INT kernels
vs. LUT-GEMM: ShiftAddLLM adds PoT quantization for scaling factors and multi-objective optimization (activation aware), eliminating FP multiplications entirely
vs. ShiftAddViT: ShiftAddLLM does not require fine-tuning (ShiftAddViT requires extensive fine-tuning)
+ 1 more
vs. MatMul-free LM [not cited in paper]: MatMul-free LM trains from scratch with ternary weights; ShiftAddLLM adapts existing pretrained models post-hoc

Limitations

Inference latency benefits are hardware-dependent and may require custom kernels to fully realize on GPUs
Block-wise scaling factors needed for large models (>30B) to balance accuracy and latency, slightly complicating implementation
Still relies on FP16 activations (though multiplications are removed, data movement remains)

Reproducibility

Code: https://github.com/GATECH-EIC/ShiftAddLLM

Code is publicly available at https://github.com/GATECH-EIC/ShiftAddLLM. The paper provides detailed algorithms for the quantization and bit allocation steps.

📊 Experiments & Results

Evaluation Setup

Language modeling and zero-shot tasks

Benchmarks:

WikiText-2 (Language Modeling (Perplexity))
C4 (Language Modeling (Perplexity))
PIQA, ARC, HellaSwag, Winogrande, BoolQ (Zero-shot Common Sense Reasoning)

Metrics:

Perplexity (PPL)
Zero-shot Accuracy
Latency
Energy Consumption
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Perplexity results on WikiText-2 showing ShiftAddLLM outperforms baselines at low bit-widths.
WikiText-2	Perplexity	16.30	9.60	-6.70
WikiText-2	Perplexity	9.99	9.60	-0.39
Energy consumption analysis for a specific MLP layer.
OPT-66B MLP Layer	Energy (Joules)	80.36	9.77	-70.59
Zero-shot accuracy averages across 5 tasks (PIQA, ARC, HellaSwag, Winogrande, BoolQ).
5-Task Average	Accuracy	59.83	63.38	+3.55

Experiment Figures

Latency and Perplexity trade-off analysis for Column-wise vs. Block-wise scaling factors on OPT-30B.

Sensitivity analysis of different layers in the Llama-2-7B model to reparameterization.

Main Takeaways

Consistent perplexity reduction compared to LUT-GEMM and GPTQ at low bit-widths (2-3 bits), validating the multi-objective optimization strategy
Automated bit allocation effectively handles layer-wise sensitivity, assigning more bits to vulnerable Q/K layers
Massive energy reductions (>80%) achieved by replacing FP16 multiplications with shift-and-add operations
Latency is comparable to or better than competitive quantized kernels (LUT-GEMM) due to efficient LUT-based implementation

📚 Prerequisite Knowledge

Prerequisites

Matrix multiplication basics
Post-Training Quantization (PTQ)
Binary Coding Quantization (BCQ)
Computer arithmetic (shifts vs. multiplications)

Key Terms

BCQ: Binary-Coding Quantization—representing a high-precision number as a weighted sum of binary values (-1, +1)

PoT: Power-of-Two—numbers of the form 2^k, which allows multiplication to be performed via efficient bitwise bit-shifting

LUT: Look-Up Table—a data structure used to store precomputed results of operations to save computation time; here used to store partial sums of shifted activations

PTQ: Post-Training Quantization—compressing a model after training without a full retraining process

Shift-and-add: An arithmetic technique replacing multiplication with a sequence of bitwise shifts and additions

Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance