MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects like math, history, and law to test general knowledge and problem solving
MMMU: Massive Multi-discipline Multimodal Understanding—a benchmark requiring college-level subject knowledge to answer questions about images
Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer
RLHF: Reinforcement Learning from Human Feedback—a method to align model behavior with human preferences using reward models
TPU: Tensor Processing Unit—Google's custom application-specific integrated circuit (ASIC) for machine learning
GSM8K: Grade School Math 8K—a dataset of high-quality linguistically diverse grade school math word problems
BLEURT: A learned evaluation metric for natural language generation (like translation) that correlates with human judgment
USM: Universal Speech Model—a family of speech models used here to encode audio features for Gemini
visual encoding: Converting visual data (images/video) into vector representations the model can process
discrete image tokens: Representing image parts as discrete codes from a vocabulary, allowing the model to generate images like it generates text words
context window: The amount of text/data a model can consider at one time (here, 32k tokens)