SFT: Supervised Fine-Tuning—adapting a pre-trained model to a specific task using labeled examples
CE: Cross-Entropy—the standard loss function that maximizes the likelihood of the correct label
GEM: Game-theoretic Entropy Maximization—the proposed training algorithm that preserves diversity
reverse KL: Reverse Kullback-Leibler divergence (KL(model || data))—a distribution distance metric that tends to cover modes rather than mean-seeking, often harder to optimize than forward KL
test-time scaling: Improving performance during inference by generating multiple samples and selecting the best one (often via a reward model or verifier)
alignment tax: The degradation of a model's general capabilities or pre-trained knowledge resulting from fine-tuning on a specific task
logit: The raw, unnormalized output scores of the neural network before applying the softmax function
Best-of-N: A sampling strategy where N different responses are generated, and the best one is selected based on a scoring mechanism
sparse update: Updating only a subset of parameters or token probabilities (specifically pivot tokens) rather than the entire vocabulary distribution
adaptive termination: Stopping the optimization for a specific sample once a condition is met (e.g., target token has highest probability) to prevent overfitting