RLHF: Reinforcement Learning with Human Feedback—a method to align LLMs with human intent using preference data.
PPO: Proximal Policy Optimization—an RL algorithm used to update the model policy while preventing it from deviating too wildly from the previous version.
GQA: Grouped-Query Attention—an optimization where multiple query heads share a single key-value head to reduce memory bandwidth usage during inference.
Ghost Attention (GAtt): A fine-tuning technique where instructions are artificially concatenated to all user messages during training to improve multi-turn instruction following.
Rejection Sampling: A fine-tuning method where the model generates multiple outputs, the best is selected by a reward model, and the model is retrained on that 'gold' sample.
SFT: Supervised Fine-Tuning—training the model on high-quality instruction-response pairs.
RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers.
SwiGLU: A specific activation function used in the feed-forward layers of the Transformer.
RMSNorm: Root Mean Square Normalization—a normalization technique applied to the inputs of transformer layers.
KV cache: Key-Value cache—storing attention keys and values to speed up autoregressive generation.