RLVR: Reinforcement Learning with Verifiable Rewards—RL where rewards are determined by objective checks (e.g., code compilation, math answers)
GenRM: Generative Reward Model—a model that evaluates responses by generating a textual critique and score rather than just outputting a scalar value
BRPO: Bootstrapped Relative Policy Optimization—the proposed algorithm that uses a randomly selected response from the current batch as a temporary reference for advantage estimation
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs to estimate advantages without a value function
SFT: Supervised Fine-Tuning—training a model on high-quality instruction-response pairs
Reward Hacking: When an RL agent exploits flaws in the reward model (e.g., by writing longer text) to maximize score without improving actual quality
Bootstrapping: In this context, using the model's own current outputs as a reference point for evaluation, rather than external data
Writing-Zero: The specific model variant trained from a base model using BRPO without prior supervised fine-tuning
Voting@n: A test-time scaling technique where the reward model evaluates multiple permutations or samples to determine the final score