GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs generated from the same input, removing the need for a critic model
TW-GRPO: Token-Weighted Group Relative Policy Optimization—the proposed method that adds token weighting and soft rewards to GRPO
KL divergence: Kullback-Leibler divergence—a statistical distance measure used here to quantify how much a token distribution differs from the average, serving as a proxy for information density
QAI: Question-Answer Inverse—a data augmentation technique that negates questions (e.g., 'did' -> 'didn't') and inverts answers to create multi-answer samples from single-choice datasets
soft reward: A continuous reward signal (0 to 1) proportional to the correctness of the answer (e.g., Intersection over Union), rather than a binary 0/1 signal
intra-group information entropy: A measure of uncertainty or variation among a group of generated responses at a specific token position; high variation implies the token is a critical decision point