MM-SP: Multi-Modal Sequence Parallelism—a distributed system design that optimizes training for VLMs by handling imbalanced image/text token loads and network bandwidth differences
Ring-style SP: A sequence parallelism method where GPUs pass activation chunks in a ring topology to compute attention
DeepSpeed-Ulysses: A sequence parallelism method that partitions the attention head dimension and uses All-to-All communication
RoPE: Rotary Position Embeddings—a method for encoding positional information in Transformers
Modality heterogeneity: The workload imbalance caused by different processing costs and token counts for visual inputs versus text inputs
Networking heterogeneity: The significant difference in bandwidth between intra-node connections (e.g., NVLink) and inter-node connections (e.g., InfiniBand)
SFT: Supervised Fine-Tuning—training a pre-trained model on labeled instruction-following data
Needle-in-a-Haystack: An evaluation task where a model must retrieve a specific piece of information hidden inside a very long context window