LRM: Large Reasoning Model—models specifically trained (often via RL) to generate long Chain-of-Thought reasoning paths (e.g., OpenAI o1, DeepSeek R1)
CoT: Chain-of-Thought—a sequence of intermediate reasoning steps generated by a model before the final answer
Latent Reasoning: Performing the reasoning process in the model's hidden states (implicit) rather than generating explicit text tokens, reducing sequence length
SFT: Supervised Fine-Tuning—training models on labeled data; here specifically used with variable-length CoT data to teach efficiency
TTS: Test-Time Scaling—enhancing performance during inference by generating more samples (horizontal) or longer chains (vertical), often at the cost of efficiency
GRPO: Group Relative Policy Optimization—an RL algorithm used to train reasoning models (referenced in context of THINKPRUNE)
Token Budget: A constraint on the number of tokens a model is allowed to generate, used to enforce concise reasoning
Quantization: Reducing the precision of model parameters (e.g., from 16-bit to 8-bit or 4-bit) to reduce memory usage
Process Reward Model: A reward model that evaluates the correctness of intermediate reasoning steps rather than just the final answer