FSDP: Fully Sharded Data Parallel—a training method where model parameters, gradients, and optimizer states are sharded across devices to save memory
Collective Communication: Communication patterns involving all devices simultaneously (e.g., All-Reduce, All-Gather) to synchronize data
Parameter Server (PS): A distributed architecture where dedicated servers hold model parameters and workers pull/push updates; known for handling stragglers well
ODC: On-Demand Communication—the proposed method replacing collectives with point-to-point operations to decouple device progress
Microbatch: A subset of a minibatch processed in one forward/backward pass to fit in GPU memory; gradients are accumulated across microbatches
SFT: Supervised Fine-Tuning—training a pre-trained model on labeled examples
RL: Reinforcement Learning—training models via rewards rather than fixed targets
GRPO: Group Relative Policy Optimization—an RL algorithm used for reasoning tasks in this paper
RDMA: Remote Direct Memory Access—technology allowing direct memory access from one computer to another without involving the CPU
NVSHMEM: NVIDIA's library for high-performance, symmetric memory communication between GPUs