RLHF: Reinforcement Learning from Human Feedback—a technique to align LLMs using reward models
PPO: Proximal Policy Optimization—a standard RL algorithm used for fine-tuning LLMs
GRPO: Group Relative Policy Optimization—an efficiency-focused variant of PPO that estimates baselines from group rewards instead of a critic model
DAG: Directed Acyclic Graph—a topological representation of the workflow where nodes are tasks and edges are dependencies
FSDP: Fully Sharded Data Parallel—a memory-efficient training strategy that shards model parameters across GPUs
Ray: An open-source unified framework for scaling AI and Python applications, used here for resource management
vLLM: A high-throughput library for LLM inference and serving
SGLang: Structured Generation Language—an inference engine optimized for complex prompting workflows
OOM: Out Of Memory—a crash error occurring when system memory is exhausted
colocated architecture: A system design where generation and training happen on the same GPUs (alternating) rather than separate server clusters