MIA: Membership Inference Attack—an attack determining if a specific data point was used to train a machine learning model.
Shadow Model: A model trained by an attacker to mimic the target model's behavior, used to generate training data for an attack classifier.
Reference Model: In this paper, a student model distilled from the target model specifically to accentuate differences between training data (members) and unseen data (non-members).
Knowledge Distillation: Training a smaller student model to reproduce the output probabilities (soft labels) or performance of a larger teacher model.
Hard Label: The ground truth label of the data (e.g., the actual next token in the sequence).
Soft Label: The probability distribution output by the teacher model (target model) for a given input.
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning method that freezes pre-trained weights and injects trainable rank decomposition matrices.
PPL: Perplexity—a measurement of how well a probability model predicts a sample; lower PPL usually indicates the model has seen the data before (member).
Fused Feature: A combination of multiple scalar features (confidence, entropy, loss) and vector features (hidden layers) used to train the attack model.