Membership Inference Attack against Large Language Model-based Recommendation Systems: A New Distillation-based Paradigm

📝 Paper Summary

Membership Inference Attacks (MIA) LLM-based Recommendation Systems Privacy in Large Language Models

The paper proposes a membership inference attack against LLM-based recommenders that uses knowledge distillation with distinct strategies for member and non-member data to create a highly discriminative reference model.

Core Problem

Traditional shadow model-based membership inference attacks (MIAs) are ineffective against LLMs due to the massive scale of training data and the difficulty of mimicking target model behavior.

Why it matters:

Privacy risks: Attackers can determine if specific user interaction records were used to fine-tune a recommendation model, potentially leaking sensitive user history.
Existing shadow models fail because they cannot match the complexity or performance of large target models (LLMs).
Reference-based attacks often require impractical access to the target model's training data distribution.

Concrete Example: In a recommendation scenario, two data samples might differ only by one item in a user's history. Single-feature attacks (like Perplexity thresholds) fail to distinguish them because their textual similarity is high, while shadow models fail because they simply try to mimic the target without emphasizing the member/non-member boundary.

Key Novelty

Distillation-based Reference Model with Feature Fusion

Instead of mimicking the target model (like shadow models), the reference model is distilled to maximize the behavioral difference between member and non-member data.
Uses 'hard labels' (ground truth) to train on members (maximizing performance) and 'soft labels' (teacher logits) on non-members (mimicking teacher behavior regardless of quality), creating a divergence gap.
Combines multiple features (confidence, entropy, loss, hidden states) via an MLP rather than relying on a single metric like perplexity.

Architecture

The two-stage pipeline of the proposed Membership Inference Attack paradigm.

Evaluation Highlights

Outperforms shadow model baselines significantly on T5-base fine-tuned models; shadow models performed near random guessing (AUC ~0.50) while the proposed method achieved much higher AUC.
Fused features consistently outperform individual features (like Loss or Entropy alone) across multiple datasets (Last.FM, MovieLens, Book-Crossing, Delicious).
Demonstrates effectiveness across diverse LLM architectures (T5, GPT-2, LLaMA3) where traditional baselines often fail to beat random guessing.

Breakthrough Assessment

7/10

Offers a clever reframing of the reference model's purpose (maximizing distinction rather than imitation) which addresses a key bottleneck in LLM privacy attacks. Results are strong against baselines, though the threat model assumes specific background knowledge.

⚙️ Technical Details

Problem Definition

Setting: Membership Inference Attack (MIA) against fine-tuned LLM-based recommendation systems in a gray-box setting.

Inputs: A target fine-tuned LLM, a specific data sample (user-item interaction record), and a small set of background knowledge (5% of member data, plus public non-member data).

Outputs: Binary prediction: whether the data sample was in the target model's fine-tuning dataset (Member) or not (Non-member).

Pipeline Flow

Target Model Fine-tuning (Teacher)
Distillation Stage: Train Reference Model (Student)
Feature Extraction
Attack Model Training & Inference

System Modules

Reference Model Distillation (Non-Member) (Stage 1: Reference Model Construction)

Train student model on non-member data using primarily Soft Labels to mimic target model behavior on unseen data.

Model or implementation: Student LLM (e.g., T5-small, GPT-2, LLaMA3-1B)

Reference Model Distillation (Member) (Stage 1: Reference Model Construction)

Continue training student on member data using primarily Hard Labels to maximize performance on seen data.

Model or implementation: Student LLM (same as above)

Feature Extractor (Stage 2: Feature Fusion & Inference)

Extract discriminative statistics from the Reference Model for a given input sample.

Model or implementation: Reference Model (Inference Mode)

Attack Model (Stage 2: Feature Fusion & Inference)

Classify sample as Member or Non-Member based on fused features.

Model or implementation: Logistic Regression Classifier with MLP feature processing

Novel Architectural Elements

Dual-strategy distillation: Using Soft Labels for non-members (to mimic behavior) and Hard Labels for members (to maximize performance) to artificially widen the gap between member/non-member representations in the reference model.
Feature fusion architecture: An MLP-based projection that upscales scalar features (loss, entropy) to match the dimensionality of hidden layer vectors before concatenation.

Modeling

Base Model: Evaluated on T5 (Encoder-Decoder), GPT-2 (Decoder-only), and LLaMA3 (Decoder-only)

Training Method: Knowledge Distillation (KD) and LoRA Fine-tuning

Objective Functions:

Purpose: Distill non-member behavior.

Formally: Loss = (1-α)*HardLoss + α*(T² * SoftLoss) where SoftLoss is KL divergence and HardLoss is Cross-Entropy.
Purpose: Distill member behavior.

Formally: Same formula, but with different α weights (typically favoring Hard Labels for members).
Purpose: Optimize attack classifier.

Formally: Logistic Regression objective.

Adaptation: LoRA (Low-Rank Adaptation) used for fine-tuning the target models

Training Data:

Datasets: Last.FM, MovieLens, Book-Crossing, Delicious
Split: 20% Non-member, 80% Member (for target training)
Attacker Knowledge: 5% of Member data, plus public Non-member data

Key Hyperparameters:

distillation_epochs: 5
attack_model_iterations: 1000
background_knowledge_ratio: 5%
+ 2 more
distillation_alpha_member: Varied (0 to 1), typically high weight on hard loss
distillation_alpha_nonmember: Varied, typically high weight on soft loss

Compute: Eight A3090 GPUs

Comparison to Prior Work

vs. Shadow Model: Shadow models try to mimic the target on ALL data; this method intentionally pushes member/non-member behavior apart in the reference model.
vs. Loss/min-k%: These rely on individual features (scalars); this method fuses multiple features (scalar + vector) extracted from a specially distilled reference model.
vs. Reference-based MIA (Mattern et al.) [not cited in paper]: This method does not require access to the target model's training data distribution, only a small subset of background knowledge.

Limitations

Requires a small subset (5%) of actual member data as background knowledge (Gray-box assumption).
Efficiency of distillation depends on the student model architecture being similar to the teacher (target) model.
Vocabulary truncation is required when student/teacher vocabularies differ, leading to potential information loss.

Reproducibility

Code: https://github.com/Cherie212/MIA4LLMRS.git

Code is publicly available at https://github.com/Cherie212/MIA4LLMRS.git. Datasets are standard recommendation benchmarks converted to natural language instructions. Specific hyperparameters for LoRA rank or learning rates are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Membership Inference on fine-tuned LLMs for recommendation tasks.

Benchmarks:

Last.FM (Recommendation (converted to text instruction)) [New]
MovieLens (Recommendation (converted to text instruction))
Book-Crossing (Recommendation (converted to text instruction))
Delicious (Recommendation (converted to text instruction)) [New]

Metrics:

AUC (Area Under ROC Curve)
Accuracy
Recall
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The proposed method significantly outperforms baselines across all datasets using T5-base as the target model.
Last.FM (T5-base)	AUC	0.5050	0.6214	+0.1164
MovieLens (T5-base)	AUC	0.5283	0.6092	+0.0809
Book-Crossing (T5-base)	AUC	0.5215	0.5694	+0.0479
Delicious (T5-base)	AUC	0.4996	0.5841	+0.0845
Ablation study on feature fusion shows that combining features yields better AUC than using single features like Loss or Entropy.
Last.FM (T5)	AUC	0.5562	0.6214	+0.0652

Experiment Figures

Kernel Density Estimation (KDE) plots of feature distributions (Confidence, Entropy, Loss) for Member vs Non-member data across Shadow Model, Raw Model, and Reference Model.

Impact of the distillation parameter alpha (weight between hard and soft loss) on attack AUC.

Main Takeaways

Shadow model-based attacks are ineffective against LLMs, often performing no better than random guessing (AUC ~0.5), likely due to the difficulty of mimicking large models.
The proposed distillation strategy successfully increases the distributional divergence between members and non-members compared to raw or shadow models.
Fused features (combining scalar metrics with hidden states) provide robustness against cases where members/non-members are textually similar.
Student models with architectures similar to teachers (e.g., T5-small distilling T5-base) achieve better attack performance than heterogeneous pairs (e.g., LLaMA distilling T5).

📚 Prerequisite Knowledge

Prerequisites

Membership Inference Attack (MIA) concepts
Knowledge Distillation (Teacher-Student training)
Fine-tuning of Large Language Models (LLMs) for recommendation
Metrics like Perplexity (PPL) and Entropy

Key Terms

MIA: Membership Inference Attack—an attack determining if a specific data point was used to train a machine learning model.

Shadow Model: A model trained by an attacker to mimic the target model's behavior, used to generate training data for an attack classifier.

Reference Model: In this paper, a student model distilled from the target model specifically to accentuate differences between training data (members) and unseen data (non-members).

Knowledge Distillation: Training a smaller student model to reproduce the output probabilities (soft labels) or performance of a larger teacher model.

Hard Label: The ground truth label of the data (e.g., the actual next token in the sequence).

Soft Label: The probability distribution output by the teacher model (target model) for a given input.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning method that freezes pre-trained weights and injects trainable rank decomposition matrices.

PPL: Perplexity—a measurement of how well a probability model predicts a sample; lower PPL usually indicates the model has seen the data before (member).

Fused Feature: A combination of multiple scalar features (confidence, entropy, loss) and vector features (hidden layers) used to train the attack model.