LoRA: Low-Rank Adaptation—a technique to fine-tune large models by updating only a small set of low-rank matrices instead of all weights
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
KV-cache: Key-Value cache—storage of pre-computed attention representations for previous tokens to speed up generation; a major memory consumer in long contexts
SFT: Supervised Fine-Tuning—training the model on labeled examples (reasoning traces) before RL alignment
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on the relative performance of a group of outputs for the same prompt
Budget Forcing: A technique proposed here using RL rewards to penalize generation lengths that exceed a specific token count bucket
Switcher: A lightweight classifier module that decides whether a prompt requires complex reasoning or simple generation
Test-Time Scaling: Improving performance by generating multiple solutions (streams) at inference time and verifying/voting to select the best one