← Back to Paper List

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Seokju Yun, Youngmin Ro
Machine Intelligence Laboratory, University of Seoul, Korea
Computer Vision and Pattern Recognition (2024)
MM Memory

📝 Paper Summary

Efficient Vision Transformers Mobile/Edge Computer Vision
SHViT reduces computational redundancy by combining a single-head attention mechanism with a large-stride patchify stem, achieving state-of-the-art speed-accuracy tradeoffs on resource-constrained devices.
Core Problem
Existing efficient Vision Transformers suffer from computational redundancy in both macro design (processing too many tokens in early stages) and micro design (using unnecessary multiple attention heads).
Why it matters:
  • High memory access costs and latency prevent ViTs from running efficiently on mobile and edge devices compared to CNNs
  • Multi-head attention incurs quadratic complexity and memory-bound overheads (reshaping, normalization) that bottleneck inference speed
  • Standard 4-stage designs with small patch sizes create a severe speed bottleneck in early stages due to excessive token counts
Concrete Example: In standard designs, the early stages (e.g., stage 1) process 3,136 tokens for a 224x224 image. The paper finds that replacing this with a larger stride stem to process just 196 tokens reduces latency significantly (3.0x faster GPU) with only a minimal accuracy drop, showing the original high token count was redundant.
Key Novelty
Single-Head Attention with Large-Stride Stem (SHViT)
  • Macro Design: Replaces the standard 4-stage, 4x4 patchify stem with a 3-stage, 16x16 patchify stem to aggressively reduce token count and memory access costs early on
  • Micro Design: Introduces Single-Head Self-Attention (SHSA) that applies attention to only a subset of channels (partial channel strategy) to capture global context without the overhead of multi-head mechanisms
  • Combines depthwise convolutions (for local details) and single-head attention (for global context) in parallel within a single block for efficient feature mixing
Evaluation Highlights
  • SHViT-S4 is 2.4x faster than MobileViTv2-1.0 on iPhone 12 while being 1.3% more accurate on ImageNet-1k
  • Outperforms EfficientNet-B0 by 2.3% accuracy while being 69.4% faster on A100 GPU and 90.6% faster on Intel CPU
  • For object detection on COCO, SHViT-S4 is 3.2x faster on A100 GPU and 8.2x faster on mobile compared to MobileFormer
Breakthrough Assessment
7/10
Strong practical contribution. While not introducing a new paradigm, it rigorously analyzes redundancy to create a highly optimized architecture that outperforms existing efficient ViTs and CNNs significantly in speed/accuracy benchmarks.
×