BSO: Bi-State Optimization—an iterative training method that alternates between optimizing on alignment data and user fine-tuning data
Lisa: Lazy Safety Alignment—the proposed method that adds a proximal term to BSO to constrain weight updates
Proximal term: A regularization term in the loss function (usually L2 distance) that penalizes the model for moving too far from a reference point (the previous checkpoint)
Excess drift: The phenomenon where model weights move too far towards a local optimum (e.g., the user task) during alternating optimization, causing forgetting of the other task (safety)
Harmful score: A metric quantifying the percentage or frequency of harmful outputs generated by the model
SFT: Supervised Fine-Tuning—standard training on labeled data
Stationary point: A point in optimization where the gradient is zero (or close to it), indicating convergence
Asymmetric computing: Allocating different amounts of computational resources (steps) to different tasks; here, spending fewer steps on alignment than on user fine-tuning