COOL RLHF: Conditional Online RLHF—a method using conditional reward models to manage conflicting preferences and multi-round PPO to reduce reward hacking
PPO: Proximal Policy Optimization—a reinforcement learning algorithm used to fine-tune the model policy based on reward signals
GQA: Grouped-Query Attention—an attention mechanism that groups query heads to reduce memory usage during inference, essential for long contexts
InternEvo: The proprietary efficient training framework used for InternLM2, supporting 4D parallelism
Needle-in-a-Haystack: A benchmark testing a model's ability to retrieve a specific piece of information ('needle') hidden within a very long context ('haystack')
MFU: Model FLOPs Utilization—a metric measuring the efficiency of hardware utilization during training
SwiGLU: A specific activation function used in LLaMA and InternLM2 architectures to improve training stability and performance
RMSNorm: Root Mean Square Layer Normalization—a normalization technique used in the transformer architecture