GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of sampled outputs for the same input, eliminating the need for a separate value function network
Ternary Reward: A three-valued reward signal (+1 correct, 0 abstain, -1 incorrect) designed to make abstention a safe middle ground between success and failure
SFT: Supervised Fine-Tuning—training a model to mimic ground-truth answers
R-Tuning: A baseline method that fine-tunes models on datasets where unanswerable questions are explicitly labeled with 'I don't know'
Truthfulness Score: A composite metric defined as w1*Accuracy + w2*Uncertainty - w3*Hallucination
RAG: Retrieval-Augmented Generation—providing external documents to the model to aid in answering questions
Hallucination: Plausible but factually incorrect statements generated by the model
Abstention: The model explicitly refusing to answer (e.g., 'I don't know') when uncertain
OOK: Out-of-Knowledge—questions where the model's internal parametric knowledge is insufficient to answer correctly