Entropy Minimization (EM): A training or inference objective that forces the model's probability distribution to become sharper (more confident), concentrating mass on fewer outputs
EM-FT: Unsupervised fine-tuning where the model is trained to minimize the token-level entropy of its own sampled outputs
EM-RL: Reinforcement learning where the reward signal is solely the negative entropy (sequence or token level) of the generated sequence
EM-INF: Inference-time method that adjusts logits during decoding to reduce entropy without updating model parameters
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of samples for the same prompt to reduce variance
RLOO: Reinforce Leave-One-Out—an RL baseline that uses the mean reward of other samples as a baseline to reduce variance
Self-consistency: An inference strategy that samples multiple reasoning paths and selects the most frequent final answer (majority voting)
Token-level entropy: The entropy of the probability distribution over the vocabulary at a specific generation step
Trajectory-level entropy: The entropy of the distribution over entire sequences (estimated via log-probability of the sequence)
SciCode: A challenging benchmark for scientific coding tasks requiring physics/math knowledge implemented in code
Minerva: A dataset of challenging math problems used for evaluating reasoning capabilities