Coop: Memory is not a Commodity

📝 Paper Summary

Memory management in Deep Learning Tensor Rematerialization (Gradient Checkpointing)

Coop co-optimizes tensor rematerialization and memory allocation by using a sliding window to evict contiguous memory blocks and allocating tensors strategically to minimize fragmentation.

Core Problem

Existing rematerialization methods assume all free memory is identical, leading to the eviction of discontiguous tensors that create fragmentation rather than usable contiguous blocks.

Why it matters:

Fragmentation prevents training large models even when total free memory is theoretically sufficient
Inefficient eviction strategies increase computational overhead by recomputing tensors that didn't help allocate the new object
Current heuristics penalize sequential evictions, inadvertently worsening fragmentation since sequential tensors are often adjacent

Concrete Example: In a CNN, DTR might evict two small discontiguous activation tensors (x0, x2) to free 100MB total. Because the holes are separated by x1, the allocator cannot fit a new 100MB tensor, forcing DTR to evict x1 as well, wasting compute.

Key Novelty

Co-optimization of Tensor Allocation and Rematerialization (Coop)

Sliding Window Eviction: Instead of picking individual tensors, search for a contiguous window of tensors to evict that satisfies the size requirement with minimal recompute cost
Cheap Tensor Partitioning: Allocator groups tensors by compute cost (cheap vs. expensive) at opposite ends of memory to create large, low-cost contiguous regions for potential eviction
Recomputable In-Place: Allows in-place mutations on unevictable tensors (like parameters) without copying, preserving the contiguous memory layout

Architecture

Contrast between DTR's fragmented eviction and Coop's structured approach. Shows memory layout with 'cheap', 'expensive', and 'unevictable' tensors.

Evaluation Highlights

Reduces memory fragmentation rate to <5% across 8 representative DNNs (e.g., GPT-3 2.7B, Swin-T), compared to much higher rates in baselines
Achieves up to 2x memory savings compared to standard training and outperforms DTR/DTE in minimum memory budget
Lowers compute overhead significantly; e.g., 11% overhead for BERT Large at 50% memory ratio vs. 41% for DTR

Breakthrough Assessment

8/10

Strong conceptual advance by linking memory layout to checkpointing. Addresses a fundamental inefficiency in prior work (fragmentation) with a practical, unified system solution.

⚙️ Technical Details

Problem Definition

Setting: Training dynamic deep neural networks under a fixed GPU memory budget

Inputs: Stream of tensor allocation and deallocation requests during forward/backward passes

Outputs: Decisions on which tensors to evict and where to allocate new tensors in the memory pool

Pipeline Flow

Interceptor receives allocation request
Check if in-place: if yes, use Recomputable In-Place
If standard allocation: try to find free block
If no block: Sliding Window Search finds optimal contiguous tensors to evict
Evict tensors -> Allocate new tensor using Cheap Tensor Partitioning strategy

System Modules

Sliding Window Search

Identifies the optimal contiguous set of tensors to evict to satisfy a specific size request

Cheap Tensor Partitioning

Directs allocation to specific ends of the memory pool based on recomputation cost to cluster similar tensors

Recomputable In-Place

Handles in-place operations without allocating new memory or breaking contiguous blocks

Novel Architectural Elements

Integration of memory address information into the eviction search (Sliding Window)
Bi-directional memory pool allocation based on compute cost (Cheap Tensor Partitioning)
Safe in-place mutation logic that preserves recomputability without copying (Recomputable In-Place)

Modeling

Base Model: Implemented in OneFlow framework (PyTorch-aligned APIs)

Training Method: Online dynamic tensor rematerialization during training

Key Hyperparameters:

cost_density_threshold: Used to distinguish linear/sub-linear ops (cheap) from super-linear ops (expensive)
sliding_window_heuristic: h(t) = projected_cost(t) / staleness(t)

Compute: Tested on NVIDIA A100 (80GB) and RTX 2080 Ti GPUs

Comparison to Prior Work

vs. DTR: DTR evicts based on size/cost heuristic ignoring location; Coop evicts based on location to ensure contiguity.
vs. DTE: DTE considers adjacency but cannot guarantee a single contiguous block; Coop guarantees it via sliding window.
vs. Checkmate: Checkmate is offline and requires static graphs; Coop is online and handles dynamic graphs [not cited in paper as direct baseline, but discussed].

Limitations

Cannot be used simultaneously with CUDA's built-in stream-ordered memory allocator (requires own memory pool)
Online method may not reach the theoretical optimality of offline solvers like Checkmate
Search latency on very small models (like ResNet-50) can be slightly higher than simple heuristics due to window traversal, though still negligible

Reproducibility

Implemented in OneFlow. Baselines (DTR, DTE, SAR) re-implemented in OneFlow for fair comparison. Code URL not explicitly provided in the text, though OneFlow is open source.

📊 Experiments & Results

Evaluation Setup

Training large DNNs on single GPUs with limited memory budgets

Benchmarks:

GPT-3 Style (2.7B) (Language Modeling)
BERT Large (Language Modeling)
Swin-Transformer (Image Classification)
ResNet-50 / Inception V3 (Image Classification)
BiLSTM / U-Net / SPOS (Various (Sequence, Segmentation, NAS))

Metrics:

Compute overhead (extra time for recomputation)
Minimum memory budget (lowest memory ratio required to train)
Search latency (time to find tensors to evict)
Memory fragmentation rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Compute overhead comparisons show Coop consistently requires less recomputation time than baselines, especially at strict memory budgets.
BERT Large	Overhead at 50% memory	41	11	-30
BERT Large	Overhead at 50% memory	29	11	-18
Fragmentation analysis demonstrates Coop's ability to maintain a clean memory layout compared to baselines.
All 8 DNNs	Fragmentation Rate	Variable (High)	5	Variable
Search latency results show Coop is faster and more stable, avoiding the multiple-search loops of prior methods.
BiLSTM	Max Search Latency (µs)	10000	100	-9900

Experiment Figures

Compute overhead vs. Memory Ratio for 8 models. Curves show overhead increasing as memory budget decreases.

Average Memory Fragmentation Rate vs. Memory Ratio.

Main Takeaways

Coop enables training at lower memory budgets (e.g., 25% lower minimum budget for GPT-3 2.7B vs DTR).
Memory fragmentation is the primary bottleneck for DTR/DTE; Coop effectively solves this via contiguous eviction.
Search latency is consistent across models because Coop finds the optimal set in a single O(N) pass, whereas DTR may loop many times.
Cheap Tensor Partitioning and Recomputable In-Place are critical components that prepare the memory layout for efficient eviction.

📚 Prerequisite Knowledge

Prerequisites

Tensor rematerialization (activation checkpointing)
Memory allocators (free lists, fragmentation)
Dynamic Computational Graphs (DCGs)
In-place operations in DL frameworks

Key Terms

Tensor rematerialization: Saving memory by deleting intermediate activations during forward pass and recomputing them during backward pass

Memory fragmentation: When free memory exists but is divided into small, non-contiguous blocks that cannot satisfy a large allocation request

DTR: Dynamic Tensor Rematerialization—a baseline method that greedily evicts tensors based on a heuristic (staleness, cost, size) without considering memory addresses

DTE: Dynamic Tensor Evicting—a method extending DTR that considers adjacent free blocks but still relies on heuristics that may fail to produce contiguous chunks

Sliding window algorithm: Coop's method to find a contiguous sequence of tensors to evict by moving a window over the memory address list

Cost density: A metric (compute cost / memory size) used to classify tensors as 'cheap' or 'expensive' for partitioning

Recomputable in-place: A mechanism allowing in-place mutation of tensors while maintaining the ability to recompute their original values, avoiding copy-on-write overhead