R-LLMs: Reasoning Large Language Models—models that generate a long chain-of-thought 'thinking process' before the final answer (e.g., OpenAI o1, DeepSeek-R1)
GRPO: Group Relative Policy Optimization—an online RL algorithm that normalizes rewards within a sampled group of outputs for the same prompt to reduce variance
VeriScore: An automatic evaluation framework for long-form factuality that extracts atomic claims and verifies them using search engine results
DPO: Direct Preference Optimization—an offline method aligning models to preferences without an explicit reward model loop
RLHF: Reinforcement Learning from Human Feedback—aligning models using human preference data
LLM-as-a-judge: Using a strong LLM (like GPT-4) to evaluate the quality of text generated by another model
SFT: Supervised Fine-Tuning—training the model on high-quality examples before applying RL
reward hacking: When an RL agent exploits flaws in the reward function to get a high score without actually solving the task (e.g., generating very short answers to maximize precision)
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps
atomic claim: A single, verifiable fact extracted from a longer sentence or paragraph