verifiable QA pairs: Question-Answer pairs where the answer is a short, objective fact (number, date, name) that can be automatically checked for correctness, enabling binary reward signals
personas: Specific roles (e.g., 'medical expert', 'patient') assigned to the generator to diversify the angles and styles of questions derived from a single document
imitation learning: Training models to mimic static datasets (like standard pretraining), which creates a dependency on the training distribution ('teacher-forcing')
continual pretraining: Continuing the pretraining process on new data to update knowledge or adapt domain, used here as a baseline to measure RL efficiency
boilerplate: Standardized, non-informative text sections like navigation bars, headers, or footers in web documents
SFT: Supervised Fine-Tuning—training on labeled examples
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps
PPO: Proximal Policy Optimization—an RL algorithm used to train the model
leakage prevention: A filtering step to ensure the question does not trivially contain the answer, forcing the model to reason or retrieve rather than copy