RM: Reward Model—a model trained to score text based on how well it aligns with human preferences (e.g., helpfulness, safety)
Decoding-time alignment: Techniques to guide an LLM towards preferred outputs during the inference phase (generation) rather than during training
Best-of-N: A baseline method where the model generates N complete responses and the Reward Model selects the highest-scoring one
Rejection Sampling: A statistical method to sample from a target distribution by generating candidates from a proposal distribution and accepting them with a specific probability
Predictive Uncertainty: A measure (often entropy) of how unsure the model is about the next token; high uncertainty often signals the start of a new semantic concept
Segment-level generation: Generating text in chunks (multiple tokens) rather than one token at a time or the whole sequence at once
BoN: Best-of-N (see above)
RS: Rejection Sampling (see above)
RLHF: Reinforcement Learning from Human Feedback—the standard training pipeline for aligning LLMs
DPO: Direct Preference Optimization—an alignment method that optimizes the policy directly without a separate explicit reward model