GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to reduce variance
Thinking Reward: A learned reward signal that evaluates the quality of the intermediate reasoning process (Chain-of-Thought), not just the final answer
Trustworthiness Weight: A dynamic coefficient that reduces the impact of the thinking reward if it aligns poorly with ground-truth outcomes (e.g., giving high scores to wrong answers)
Annealing: Gradually reducing a parameter (here, the thinking reward weight) over the course of training
SFT: Supervised Fine-Tuning—training a model on labeled examples before applying RL
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
Reward Hacking: When an RL agent exploits flaws in the reward function to get high scores without actually achieving the intended goal
PRM: Process Reward Model—a model trained to evaluate the correctness of individual reasoning steps
VisualPRM: A baseline method extending process rewards to multimodal tasks using step-level supervision