SFT: Supervised Fine-Tuning—training the model to imitate expert solution trajectories
RL: Reinforcement Learning—training the model to explore and optimize rewards (correct answers)
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of outputs for the same prompt, removing the need for a critic model
Syn-SFT-RL: Synchronized SFT-RL—approaches that combine SFT imitation loss and RL reward loss into a single joint optimization loop
Plasticity: The remaining capacity of a model to improve its performance during the RL phase after being fine-tuned
DAPO: An enhanced RL algorithm based on GRPO that uses asymmetric clipping and dynamic difficulty sampling for better stability
SRFT: A synchronized method combining SFT loss, off-policy exploration, and on-policy rejection sampling
LUFFY: A synchronized method optimizing a mixture of off-policy expert data and on-policy generated data
UPT: A method that gates between SFT and RL objectives based on the model's current reward performance