RLHF: Reinforcement Learning from Human Feedback—aligning LLMs to human values using preference data
PPO: Proximal Policy Optimization—the standard RL algorithm used for fine-tuning LLMs in RLHF
3D Parallelism: Combining Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) to train massive models
ZeRO: Zero Redundancy Optimizer—a method to reduce memory footprint by sharding optimizer states, gradients, and parameters across GPUs
Actor: The LLM being trained to generate responses
Critic: An LLM that estimates the value of the generated responses to guide the Actor
Resharding: Changing the distribution of model parameters across GPUs (e.g., switching from TP=4 to TP=8) between computation stages
Ray: A unified framework for scaling AI and Python applications, used here as the backend for the single-controller
Multi-controller: A paradigm where each GPU worker runs its own control loop, common in PyTorch distributed training
Single-controller: A paradigm where a central process dispatches tasks to workers, offering global visibility but potential overhead