Generative Replay: A continual learning technique where a model generates samples of its past knowledge to retrain itself, preventing forgetting
Safety Alignment: Training a model to adhere to safety criteria (e.g., helpful and harmless), typically by refusing to answer malicious queries
KL Divergence: Kullback-Leibler divergence—a statistical measure of how one probability distribution differs from a second, reference probability distribution
SFT: Supervised Fine-Tuning—training a pre-trained model on labeled examples to adapt it to a specific task
Harmful Score: The percentage of model responses to malicious queries that are classified as unsafe by a guardrail model
Perplexity: A measurement of how well a probability model predicts a sample; used here to filter out low-quality synthetic text
Guardrail Model: A separate classifier used to evaluate whether an LLM's output is safe or harmful
Alignment Residual: The difference between the model's current behavior and the ideal safety behavior, which GR-SAP aims to minimize