GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same prompt to stabilize training
Slow-thinking models: LLMs trained to generate extended 'thought' processes (reasoning traces) before producing a final answer, similar to OpenAI o1 or DeepSeek-R1
Atomic facts: Small, indivisible statements extracted from a longer text that can be independently verified as true or false
FactScore: A metric that evaluates the factuality of long-form text by breaking it into atomic facts and checking what percentage are supported by a knowledge source
DeBERTa: Decoding-enhanced BERT with disentangled attention—a transformer model often used for natural language understanding tasks like entailment and verification
KL divergence: A statistical distance measure used in RL to prevent the new policy from drifting too far from the reference policy
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights
GPQA: A challenging dataset for general-purpose question answering requiring PhD-level reasoning
SimpleQA: A benchmark designed to measure the factual correctness of LLMs on short, factual questions
Entropy bonus: A term added to the loss function to encourage the model to maintain diversity in its outputs and prevent collapsing to a single repetitive response