EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

📝 Paper Summary

Efficient Vision Transformers Mobile Vision

EfficientViT accelerates vision transformers by replacing memory-bound attention layers with feed-forward networks in a sandwich layout and decomposing attention heads via cascaded feature splitting to reduce redundancy.

Core Problem

Vision Transformers (ViTs) suffer from slow wall-clock inference speed on edge devices despite low FLOP counts, primarily due to memory access overhead and computational redundancy.

Why it matters:

High theoretical efficiency (low FLOPs) in existing lightweight ViTs often does not translate to actual speedups on GPUs/CPUs due to memory-bound operations
Real-time applications (e.g., mobile) require high throughput, but standard ViTs like Swin or DeiT are hindered by frequent tensor reshaping and element-wise functions
Redundancy in attention maps means compute resources are wasted calculating similar features across different heads

Concrete Example: In standard Multi-Head Self-Attention (MHSA), all heads process the full input feature, requiring expensive reshaping and copying. The paper shows that reducing MHSA layers (which are memory-bound) to just 20-40% of the network and substituting them with FFNs (Feed Forward Networks) reduces runtime significantly without hurting accuracy.

Key Novelty

Sandwich Layout & Cascaded Group Attention

Sandwich Layout: Places a single memory-expensive self-attention layer between multiple memory-efficient FFN layers, reducing memory access overhead while maintaining channel communication
Cascaded Group Attention (CGA): Splits input features into chunks and feeds a different split to each attention head, then cascades the output of one head to the next, reducing redundancy
Parameter Reallocation: Redistributions parameters based on Taylor pruning analysis, expanding critical value projections while shrinking less important hidden dimensions in FFNs

Architecture

Overview of EfficientViT architecture, the Sandwich Layout block, and Cascaded Group Attention (CGA) module.

Evaluation Highlights

EfficientViT-M5 achieves 77.1% accuracy on ImageNet-1K, surpassing MobileNetV3-Large by 1.9% while running 40.4% faster on V100 GPU
EfficientViT-M2 outperforms MobileViT-XXS by 1.8% in accuracy while running 5.8x faster on V100 GPU and 3.7x faster on Intel Xeon CPU
Converted to ONNX format, EfficientViT-M2 runs 7.4x faster than MobileViT-XXS on CPU

Breakthrough Assessment

7/10

Strong engineering contribution. While the components (group conv concepts applied to attention) are evolutionary, the systematic analysis of memory-bound bottlenecks and the resulting speed/accuracy Pareto frontier shift are significant for practical deployment.

⚙️ Technical Details

Problem Definition

Setting: Image classification and downstream vision tasks (detection, segmentation) on resource-constrained hardware

Inputs: Input image (typically 224x224x3)

Outputs: Class probabilities or feature maps

Pipeline Flow

Input Image -> Overlapping Patch Embedding
Stage 1 (EfficientViT Blocks)
EfficientViT Subsample
Stage 2 (EfficientViT Blocks)
EfficientViT Subsample
Stage 3 (EfficientViT Blocks)
Global Avg Pool -> Classifier

System Modules

EfficientViT Block (Sandwich Layout) (Feature Extraction)

Process spatial and channel information efficiently

Model or implementation: Sandwich structure: FFN -> Token Interaction -> MHSA (or CGA) -> FFN

Cascaded Group Attention (CGA) (Feature Extraction)

Compute attention with reduced redundancy and enhanced diversity

Model or implementation: Heads fed with feature splits X_{ij}; output of head j added to input of head j+1

Novel Architectural Elements

Sandwich Layout Block: N FFN layers wrapping a single Attention layer (vs. 1:1 ratio in standard ViT)
Cascaded Group Attention: Input features split across heads; cascaded addition of head outputs (Head j output -> Head j+1 input)

Modeling

Base Model: EfficientViT (M0 to M5 variants)

Training Method: Supervised learning (Image Classification)

Training Data:

ImageNet-1K

Key Hyperparameters:

epochs: 300
batch_size: 2048
learning_rate: 1e-3
+ 3 more
weight_decay: 2.5e-2
optimizer: AdamW
augmentation: Mixup, AutoAugment, Random Erasing

Compute: Training: 8 Nvidia V100 GPUs. Inference throughput measured on V100 GPU and Intel Xeon E5-2690 v4 CPU.

Comparison to Prior Work

vs. MobileNetV3: EfficientViT adds global attention capabilities via CGA while maintaining speed
vs. MobileViT: EfficientViT focuses on wall-clock speed (memory efficiency) rather than just parameter/FLOP reduction
vs. Swin-T: EfficientViT reduces the ratio of memory-bound MHSA layers relative to FFNs
+ 1 more
vs. LeViT: EfficientViT uses BN/ReLU and sandwich layout, outperforming LeViT in ONNX speed [cited in paper]

Limitations

Model size is slightly larger than MobileNetV3 (e.g., M5 is 12.4M params vs MobileNetV3-Large 5.4M) despite faster inference
Slower than MobileNetV3-Large when converted to ONNX format (11.5% slower), attributed to reshaping operations in self-attention
Performance on 'Cars' fine-grained dataset slightly inferior to CNN baselines

Reproducibility

Code availability stated as 'available at here' (implied link in original PDF, not extracted in text). Hyperparameters for training and architecture variants (M0-M5) provided in tables. Hardware specifics for throughput testing provided.

📊 Experiments & Results

Evaluation Setup

ImageNet-1K Classification

Benchmarks:

ImageNet-1K (Image Classification)
COCO val2017 (Object Detection)

Metrics:

Top-1 Accuracy (%)
Throughput (images/s) on GPU/CPU
FLOPs (M/G)
Parameters (M)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
EfficientViT demonstrates superior trade-offs between accuracy and real-world throughput compared to state-of-the-art CNNs and ViTs on ImageNet.
ImageNet-1K	Top-1 Accuracy (%)	75.2	77.1	+1.9
ImageNet-1K	GPU Throughput (imgs/s)	7560	10621	+3061
ImageNet-1K	Top-1 Accuracy (%)	69.0	70.8	+1.8
ImageNet-1K	GPU Throughput (imgs/s)	4456	18218	+13762
ImageNet-1K	Top-1 Accuracy (%)	70.2	71.3	+1.1
ImageNet-1K	ONNX Speed (imgs/s)	87.5	108.6	+21.1

Experiment Figures

Speed (Throughput) vs Accuracy plot on ImageNet-1K for EfficientViT and competitors.

Runtime profiling of Swin-T and DeiT-T showing time spent on memory-bound vs compute-bound operations.

Main Takeaways

Memory-bound operations (reshaping, element-wise) in MHSA are the primary bottleneck for ViT speed, not just FLOPs.
Reducing MHSA frequency (Sandwich Layout) significantly improves throughput without compromising accuracy.
Cascaded Group Attention effectively mitigates head redundancy and improves computational efficiency.
Parameter reallocation (expanding Value projections, shrinking FFN expansion) improves parameter efficiency.

📚 Prerequisite Knowledge

Prerequisites

Vision Transformer (ViT) architecture (MHSA, FFN)
Memory-bound vs. Compute-bound operations
Convolutional Neural Networks (Depthwise Conv)

Key Terms

MHSA: Multi-Head Self-Attention—the core component of Transformers that computes relationships between all tokens

FFN: Feed-Forward Network—a simpler layer usually consisting of two linear transformations and an activation

Memory-bound: Operations where execution speed is limited by how fast data can be moved between memory and the processor, rather than calculation speed

Sandwich Layout: A proposed block structure where one attention layer is placed between multiple FFN layers to minimize memory-heavy attention operations

CGA: Cascaded Group Attention—a novel attention mechanism where heads receive different splits of the input feature and outputs are cascaded

ONNX: Open Neural Network Exchange—an open format for representing machine learning models, often used for deployment

Taylor structured pruning: A method to estimate the importance of network channels using gradient-weight products to guide parameter allocation

Flops: Floating Point Operations—a theoretical measure of compute cost, often loosely correlated with actual latency