PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System

📝 Paper Summary

Processing-In-Memory (PIM) Benchmarking Distributed Optimization Algorithms

PIM-Opt evaluates distributed optimization algorithms on real-world UPMEM PIM hardware, demonstrating that communication-efficient algorithms like ADMM are essential to overcome the host-to-memory bandwidth bottleneck.

Core Problem

Training large-scale ML models on processor-centric architectures (CPUs/GPUs) is bottlenecked by data movement, yet existing Processing-In-Memory (PIM) research lacks evaluation of popular distributed SGD algorithms on real-world PIM hardware.

Why it matters:

Data movement between memory and processors is the primary energy and performance bottleneck for large-scale ML training
Prior PIM works often evaluate Gradient Descent (not the widely used SGD) or rely on simulation rather than real hardware
The UPMEM PIM system resembles a distributed system but lacks direct inter-node communication, creating unique algorithmic constraints

Concrete Example: When training Logistic Regression on the Criteo dataset, standard MA-SGD (Model Averaging SGD) generates 64x more data movement between PIM and the host than ADMM (Alternating Direction Method of Multipliers), causing the host CPU to become a communication bottleneck.

Key Novelty

Communication-Aware Algorithm Selection for PIM

Identifies that UPMEM PIM acts as a distributed system with a strict star topology (host as central node), making communication with the host the primary bottleneck
Demonstrates that algorithms like ADMM (which synchronize less frequently) vastly outperform standard SGD variants (MA-SGD, GA-SGD) on PIM hardware by minimizing host-device data transfer
Implements quantized training pipelines specifically optimized for the integer-only arithmetic units available in UPMEM DPUs

Architecture

High-level workflow for distributed optimization algorithms on the UPMEM PIM system

Evaluation Highlights

Training SVM with GA-SGD on PIM is 1.94x faster than a dual AMD EPYC 7742 CPU baseline on the YFCC100M dataset
PIM achieves 3.19x speedup over an NVIDIA A100 GPU running mini-batch SGD for SVM training on YFCC100M
ADMM reduces communication overhead significantly, achieving 39.79x speedup over MA-SGD for Logistic Regression on the CPU baseline

Breakthrough Assessment

7/10

First comprehensive evaluation of distributed SGD on real PIM hardware. Provides critical insights into algorithmic suitability and scalability limits, though limited to linear models and specific PIM hardware.

⚙️ Technical Details

Problem Definition

Setting: Distributed convex optimization for binary classification using linear models

Inputs: Large-scale training dataset D = {(x_i, y_i)} partitioned across PIM nodes

Outputs: Optimized model parameters w* minimizing loss L(w)

Pipeline Flow

Host CPU (Partitions Data)
PIM Nodes/DPUs (Local Training)
Host CPU (Synchronization/Aggregation)
PIM Nodes/DPUs (Model Update)

System Modules

Host CPU

Parameter Server: partitions data, aggregates models/gradients, broadcasts global model

Model or implementation: 2x Intel Xeon Silver 4215 (Host)

UPMEM DPU

Compute Node: executes mini-batch SGD on local data partition

Model or implementation: UPMEM DPU (32-bit RISC, 350 MHz)

Novel Architectural Elements

Implementation of distributed SGD algorithms (MA-SGD, GA-SGD, ADMM) tailored for UPMEM's star-topology constraints
LUT-based approximation for sigmoid functions to bypass PIM hardware's lack of native transcendental instruction support

Modeling

Base Model: Linear Models: Logistic Regression (LR) and Support Vector Machines (SVM)

Training Method: Distributed SGD (Stochastic Gradient Descent)

Objective Functions:

Purpose: Minimize classification error with regularization.

Formally: L(w) = (1/n) * sum(l(x_i, y_i, w)) + lambda * r(w)
Purpose: Logistic Regression Loss.

Formally: Binary Cross Entropy with L1 (ADMM) or L2 (SGD) regularization
Purpose: SVM Loss.

Formally: Hinge Loss with L2 regularization

Training Data:

YFCC100M-HNfc6: 97M samples, 4096 dense features, 13.96GB (for 256 DPUs)
Criteo 1TB Click Logs: 4.37B samples, 1M sparse features, 8.05GB (for 256 DPUs)

Key Hyperparameters:

batch_size_YFCC: 8, 16, 32, 64 (MA-SGD/ADMM); 4K, 8K, 16K, 32K (GA-SGD)
batch_size_Criteo: 1K, 2K, 4K, 8K (MA-SGD/ADMM); 131K-1M (GA-SGD)
regularization: L2 for SVM/LR (SGD), L1 for LR (ADMM)
+ 1 more
quantization: 32-bit fixed-point for data and models

Compute: Training performed on 2560 UPMEM DPUs (20 modules). Host: 2x Intel Xeon Silver. Baseline: 2x AMD EPYC 7742 or 1x NVIDIA A100.

Comparison to Prior Work

vs. GPU/CPU: PIM offers higher aggregate bandwidth but restricted communication topology
vs. Prior PIM ML works: Evaluates SGD/ADMM on real hardware rather than GD or simulation
vs. TransPIM [not cited in paper]: Focuses on linear models and optimization algorithms rather than Transformers

Limitations

No native floating-point support; relies on quantization and LUTs
No direct inter-DPU communication; all synchronization must go through the host CPU
Strong scaling leads to accuracy degradation due to increased model staleness with more workers
Currently limited to linear models (LR, SVM); deep learning models not evaluated

Reproducibility

Code: https://github.com/CMU-SAFARI/PIM-Opt

Code publicly available at https://github.com/CMU-SAFARI/PIM-Opt. Hyperparameters and dataset preprocessing details provided. UPMEM hardware required for replication.

📊 Experiments & Results

Evaluation Setup

Distributed training of linear binary classifiers on large-scale datasets

Benchmarks:

YFCC100M-HNfc6 (Dense feature binary classification)
Criteo 1TB Click Logs (High-dimensional sparse binary classification)

Metrics:

Total Training Time (seconds)
Test Accuracy (%) / AUC Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
PIM outperforms CPU baselines significantly for specific dense workloads when using communication-efficient algorithms.
YFCC100M-HNfc6	Speedup vs CPU	1.0	1.94	+0.94x speedup
YFCC100M-HNfc6	Speedup vs GPU	1.0	3.19	+2.19x speedup
YFCC100M-HNfc6	Speedup (PIM internal)	1.0	3.19	+2.19x speedup
YFCC100M-HNfc6	Training Time (s)	3730.55	151.36	-3579.19
Strong scaling experiments reveal a trade-off between training speed and model accuracy due to staleness.
YFCC100M-HNfc6	Test Accuracy (%)	95.46	92.17	-3.29
YFCC100M-HNfc6	Speedup	1.0	7.43	+6.43x speedup

Experiment Figures

Comparison of Total Training Time vs Test Accuracy for different algorithms (MA-SGD, GA-SGD, ADMM) on PIM, CPU, and GPU

Strong scaling results (Time and Accuracy) as number of DPUs increases from 256 to 2048 with fixed dataset size

Main Takeaways

The UPMEM PIM system is a viable alternative to CPUs/GPUs for memory-bound training of small dense models, achieving up to 3.19x speedup over A100 GPU.
Algorithm selection is critical: ADMM vastly outperforms standard SGD variants on PIM by mitigating the host-communication bottleneck.
PIM systems struggle with large sparse models (like Criteo) where communication overhead dominates, performing 2.43x slower than CPU baselines.
Scalability is limited by 'statistical efficiency': adding more PIM nodes speeds up iterations but degrades convergence/accuracy due to increased staleness.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Stochastic Gradient Descent (SGD) and its distributed variants
Basic computer architecture concepts (memory bandwidth, SIMD/SPMD)
Knowledge of Processing-In-Memory (PIM) concepts

Key Terms

PIM: Processing-In-Memory—computation performed inside or near memory units to reduce data movement

UPMEM: A commercial general-purpose PIM architecture integrating processors (DPUs) directly into DRAM modules

DPU: DRAM Processing Unit—the core processor inside UPMEM modules, possessing its own MRAM (main memory) and WRAM (scratchpad)

MA-SGD: Model Averaging SGD—workers update local models independently and periodically average them at a central server

GA-SGD: Gradient Averaging SGD—workers compute gradients on parts of a batch and synchronize gradients at every step

ADMM: Alternating Direction Method of Multipliers—an optimization algorithm that decomposes problems into sub-problems solved locally, requiring less frequent synchronization

SPMD: Single-Program Multiple-Data—a programming model where multiple threads execute the same program on different data items

MRAM: Magnetoresistive Random Access Memory—used here to refer to the 64MB DRAM bank exclusive to each DPU

WRAM: Working RAM—64KB SRAM scratchpad memory within each DPU used for operands and instructions

LUT: Look-Up Table—a method to approximate complex functions (like sigmoid) using precomputed values, essential for DPUs lacking floating-point units