DPO: Direct Preference Optimization—an algorithm for aligning LLMs to preferences without a separate reward model, using a contrastive loss on probability ratios
RBF: Radial Basis Function—a kernel function that measures similarity based on distance, effective for capturing local, non-linear relationships
HMK: Hierarchical Mixture of Kernels—a proposed method that learns to weight and combine different kernels (local and global) dynamically during training
HT-SR: Heavy-Tailed Self-Regularization—a theoretical framework used to measure overfitting in neural networks by analyzing the eigenvalue distribution of weight matrices
KL Divergence: Kullback-Leibler Divergence—a statistical measure of how one probability distribution differs from a second, reference distribution
Null Space: In this context, a region of the weight space where inputs (like unsafe prompts) are mapped to zero or negligible activation, effectively neutralizing them
PND: Positive-Negative Divergence—a proposed metric to measure the separability of positive and negative preference pairs in the embedding space