Performance of Small Language Model Pretraining on FABRIC: An Empirical Study

📝 Paper Summary

Distributed Training Small Language Models (SLMs)

Pipeshard parallelism significantly outperforms data parallelism for pretraining small language models on geographically distributed academic GPU clusters by effectively masking high network latency.

Core Problem

Pretraining language models typically requires expensive, low-latency clusters; academic users relying on free, geo-distributed testbeds (like FABRIC) face high network latency that degrades standard distributed training performance.

Why it matters:

Academic researchers cannot afford the massive compute budget or low-latency interconnects (like NVLink) used by industry for LLM training
Standard data parallelism fails or becomes prohibitively slow when GPU nodes are separated by wide-area networks (10ms+ latency)
Domain-specific vector databases require custom embeddings from models pretrained on specialized datasets, necessitating accessible pretraining methods for SLMs

Concrete Example: When training GPT-2 on a cluster spanning Utah and Amsterdam (103ms latency), standard Data Parallelism takes 1,375 minutes for 20 epochs due to synchronization overhead, whereas the proposed Pipeshard approach finishes in 100 minutes.

Key Novelty

Empirical Strategy Selection for Geo-Distributed Pretraining

Demonstrates that Alpa's Pipeshard (combining intra-operator and pipeline parallelism) tolerates high network latency (10-100ms) far better than Data Parallelism or ZeRO
Proposes a heuristic algorithm to dynamically select the best parallelization strategy (Data vs. Pipeshard) based on cluster topology and measured throughput

Architecture

Conceptual illustration of different parallelization techniques: Data, Intra-Operator, and Inter-Operator (Pipeline) parallelism

Evaluation Highlights

Pipeshard achieves 13.7x speedup (100 min vs 1,375 min) over Data Parallelism for GPT-2 Medium on a cross-continent cluster (103ms latency)
On a US-based distributed cluster (20ms latency), Pipeshard reaches 2.49 TFLOP/s compared to Data Parallelism's 0.88 TFLOP/s
Pipeshard successfully trains GPT-2 Large on heterogeneous hardware where Data and Shard parallelism fail due to Out-Of-Memory errors

Breakthrough Assessment

5/10

A solid empirical study offering practical guidelines for academic pretraining on suboptimal hardware. While not an algorithmic breakthrough, it validates existing tools (Alpa) in a novel, resource-constrained environment.

⚙️ Technical Details

Problem Definition

Setting: Distributed pretraining of Transformer-based Small Language Models (SLMs) on commodity GPU clusters connected via high-latency TCP/IP networks

Inputs: Unlabeled text corpus (Wikipedia 20231101.ace)

Outputs: Pretrained Model Weights

Modeling

Base Model: GPT-2 Medium (345M params) and GPT-2 Large (774M params)

Training Method: Distributed Pretraining (Self-Supervised Causal Language Modeling)

Objective Functions:

Purpose: Predict the next token in a sequence.

Formally: Minimize negative log-likelihood of the next token given previous context.

Training Data:

Wikipedia dataset (20231101.ace) from HuggingFace
Used for 20 epochs

Key Hyperparameters:

n_ctx: 1024
n_embd: 1024 (Medium) / 1280 (Large)
n_head: 16 (Medium) / 20 (Large)
+ 2 more
n_layer: 24 (Medium) / 30 (Large)
epochs: 20

Compute: Clusters of 4 GPUs using NVIDIA RTX 6000 (24GB), T4 (16GB), or A30 (24GB). Training time ranges from ~30 mins to ~90 hours depending on latency and method.

Comparison to Prior Work

vs. Data Parallelism (PyTorch DDP): Pipeshard reduces all-to-all communication overhead, making it viable for high-latency (>20ms) connections
vs. ZeRO2: ZeRO2 is more memory efficient for single-site/low-latency but scales poorly with high network latency compared to Pipeshard
vs. StellaTrain [not cited in paper]: StellaTrain also optimizes for WAN training but focuses on dynamic batching/compression, whereas this work evaluates Alpa's static plan generation on FABRIC

Limitations

Pipeshard requires more GPU memory than ZeRO2, leading to failures on heterogeneous clusters with smaller GPUs (e.g., T4 mixed with RTX)
Collective communication in NCCL over FABRIC uses TCP/IP, which is significantly slower than NVLink
Study limited to GPT-2 family; scaling to modern Llama/Mistral architectures not explicitly tested
Network latency is the primary bottleneck analyzed; packet loss or jitter effects are not detailed

Reproducibility

Code: https://github.com/whatdhack/mini_llms

publicly available (https://github.com/whatdhack/mini_llms). The study uses open-source Alpa and Ray libraries. Specific Alpa fork/setup described in paper (Python 3.8.10, CUDA 11.8). FABRIC testbed requires academic access request.

📊 Experiments & Results

Evaluation Setup

Pretraining GPT-2 models on 5 different GPU cluster configurations with varying inter-node network latencies (0.1ms to 103ms)

Benchmarks:

Wikipedia Pretraining (20 epochs) (Causal Language Modeling)

Metrics:

Training Performance (TFLOP/s)
Total Execution Time (minutes)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiments on geographically distributed clusters demonstrate Pipeshard's resilience to high network latency compared to Data Parallelism.
UTAH-GPN (20.2ms latency)	Training Performance (TFLOP/s)	0.88	2.49	+1.61
GAT-AMST (103ms latency)	Execution Time (min)	1375	100	-1275
Single-site benchmarks show that standard Data Parallelism remains superior when network latency is negligible.
TACC-TACC (0.1ms latency)	Training Performance (TFLOP/s)	15.74	12.17	-3.57

Experiment Figures

Bar charts comparing execution time and TFLOP/s for GPT-2 models on the UTAH-GPN cluster (20ms latency)

Performance on the highest latency cluster (GAT-AMST, 103ms)

Main Takeaways

Pipeshard (combining pipeline and shard parallelism) is the only viable strategy for training across geographically distributed nodes with high latency (>20ms)
Standard Data Parallelism and ZeRO degrade rapidly as network latency increases due to frequent all-to-all gradient synchronization
In low-latency (single-site) environments, simple Data Parallelism typically outperforms complex pipeline strategies for SLMs
Hardware heterogeneity (mixing GPU types) causes memory allocation failures for Pipeshard, whereas ZeRO2 is more robust to memory constraints

📚 Prerequisite Knowledge

Prerequisites

Distributed Deep Learning concepts (Data vs. Model Parallelism)
Transformer architecture basics
Network latency impact on synchronization

Key Terms

FABRIC: A simplified name for the NSF-funded nationwide research infrastructure offering programmable compute and networking for academic users

SLM: Small Language Model—models with fewer parameters (e.g., GPT-2, 125M-700M params) compared to LLMs, suitable for academic budgets

Pipeshard: A parallelism strategy from Alpa that combines pipeline parallelism (inter-operator) with shard parallelism (intra-operator) to optimize communication

Data Parallelism: A training strategy where the model is replicated on every GPU and gradients are synchronized (averaged) after each step

ZeRO: Zero Redundancy Optimizer—a method to reduce memory usage in data parallelism by partitioning optimizer states across GPUs

Shard Parallelism: Intra-operator parallelism where individual tensors/operators are partitioned across devices (similar to Megatron-LM tensor parallelism)

TFLOP/s: Tera Floating Point Operations Per Second—a measure of computer performance/training throughput

L2STS: Layer 2 Site-to-Site Connection Service—a FABRIC network service connecting VMs between different geographic sites

L2Bridge: Layer 2 Bridge Service—a FABRIC network service connecting VMs within a single site

Microbatches: Small chunks of a training batch used in pipeline parallelism to overlap computation and communication