RLHF: Reinforcement Learning from Human Feedback—a method to align LLMs with human intent using a reward model
PPO: Proximal Policy Optimization—the standard RL algorithm used here, involving an Actor (policy) and Critic (value function)
Pipeline Bubbles: Idle time in Pipeline Parallelism where GPUs wait for data from previous stages or gradients from later stages
1F1B: One-Forward-One-Backward—a standard pipeline parallelism schedule
Data Skewness: The phenomenon where a small percentage of generated responses are significantly longer than average, causing load imbalance
Micro-batches: Small chunks of a data batch processed sequentially in pipeline parallelism to reduce bubble size
Actor: The main LLM being trained to generate responses
Critic: The value model that estimates the expected reward of the Actor's actions
Reference Model: A frozen copy of the Actor used to calculate KL divergence penalties
Reward Model: A frozen model that assigns scores to the Actor's generated responses