RLHF: Reinforcement Learning from Human Feedback—a method to align language models with human preferences
PPO: Proximal Policy Optimization—the standard RL algorithm used for training the Actor and Critic models in RLHF
Parameter Reallocation: The process of dynamically moving model weights between GPUs during training to change the parallelization configuration
Model Function Call: A specific computational task in the RLHF loop (e.g., Actor Generation, Critic Training) treated as a node in the dataflow graph
3D Parallelism: Combining Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) to distribute large models
Device Mesh: A logical grid of GPUs representing the available hardware resources
MCMC: Markov Chain Monte Carlo—a search algorithm used here to explore the vast space of possible execution plans
DPO: Direct Preference Optimization—an alternative alignment algorithm to PPO
RPC: Remote Procedure Call—method used for communication between the master worker and model workers