SFT: Supervised Fine-Tuning—training a model to imitate expert demonstrations (prompts + answers)
RLVR: Reinforcement Learning with Verifiable Rewards—RL where correctness is determined by a rule-based checker (e.g., math answers)
Bilevel Optimization: A mathematical problem where one problem (upper-level) contains another problem (lower-level) as a constraint
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small rank-decomposition matrices
Cooperative Gain: The performance advantage of joint SFT-RL training over RL training alone, explicitly maximized by the BRIDGE upper-level objective
Danskin's Theorem: A mathematical theorem used to compute gradients of functions defined by maximization problems, used here to differentiate through the RL step
PPO: Proximal Policy Optimization—a standard RL algorithm used to update the policy
GRPO: Group Relative Policy Optimization—an RL algorithm used effectively in reasoning models like DeepSeek-R1
Cold Start: The standard practice of training a model with SFT first to provide a good initialization before starting RL
Catastrophic Forgetting: The tendency of a neural network to completely and abruptly forget previously learned information upon learning new information
SFT-RL Alternating: A simple baseline introduced in the paper that switches between SFT and RL updates without the cooperative bilevel objective