RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness checks (like compiling code or solving math) as reward signals
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same prompt to reduce variance
Text2SQL: The task of converting natural language questions into executable SQL database queries
BIRD: A large-scale, cross-domain dataset for Text2SQL parsing known for containing real-world noise
Gold Annotations: The ground-truth answers or labels provided in a dataset
Symbolic Equivalence: Verifying if two mathematical expressions represent the same value despite different forms (e.g., 1/2 vs 0.5)
LLM-as-a-Judge: Using a strong Large Language Model (like GPT-5) to evaluate the correctness of outputs from a smaller model
Format Reward: A reward signal given solely for adhering to a specific output structure (e.g., putting the answer in \boxed{}) regardless of correctness