GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance
Parametric Knowledge: Facts stored within the model's neural network weights during pre-training, as opposed to knowledge retrieved from external documents
Knowledge Boundary: The dividing line between what a model actually 'knows' (has stored in weights) and what it fabricates; KLCF aims to keep generation within this boundary
Checklist Reward: A reward signal calculated by comparing the generated text against a pre-computed list of facts the base model is known to possess
Truthfulness Reward: A reward signal from a trained classifier estimating the probability that atomic claims in the output are true
Atomic Claim: A single, indivisible factual statement extracted from a longer sentence (e.g., 'Obama was born in Hawaii')
SFT: Supervised Fine-Tuning—training on labeled examples before RL
Recall: In this context, the percentage of pre-verified 'known' facts (from the checklist) that appear in the generated response
Precision: In this context, the percentage of generated claims that are factually correct
Hallucination Tax: The phenomenon where alignment techniques (like RLHF) degrade a model's factual accuracy by pressuring it to answer questions beyond its knowledge