Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis

📝 Paper Summary

Linear memory (buffer management) Conversational personalization

A framework for on-device LLM personalization selects representative data via unsupervised metrics and synthesizes semantic variations to enable efficient fine-tuning with limited storage and sparse annotation.

Core Problem

On-device personalization faces conflicting constraints: limited storage prevents keeping all user data, privacy prevents cloud offloading, and user annotations must remain sparse to avoid annoyance.

Why it matters:

Generic pre-trained models fail to adapt to individual user contexts, preferences, and unique interaction habits in real-time
Standard fine-tuning assumes large storage and IID (Independent and Identically Distributed) data sampling, which is impossible with streaming edge data
Existing continual learning methods struggle with temporally correlated streams where data value varies significantly over time

Concrete Example: A user interacts with a robot assistant. The stream contains repetitive, low-value 'uncontroversial dialogue' before switching to a useful, unique interaction. Standard buffers might fill up with the repetitive data due to temporal correlation, discarding the unique interaction and preventing personalization.

Key Novelty

Self-Supervised Data Selection and Synthesis (SDSS)

Selects data for a small memory buffer using three unsupervised metrics: entropy (information content), domain score (relevance), and dissimilarity (uniqueness vs. buffer)
Augments the small selected dataset by prompting the LLM to synthesize multiple semantically similar question-answer pairs, acting as a data multiplier without user effort

Architecture

The three-stage framework: (1) Data Selection using quality metrics, (2) Data Synthesis using the LLM, and (3) Fine-tuning.

Evaluation Highlights

Achieves up to 38% higher ROUGE-1 score compared to vanilla baselines on datasets like ALPACA and MedDialog
Demonstrates improved learning speed and content-generating accuracy by fine-tuning only on high-value, representative data rather than random samples

Breakthrough Assessment

7/10

First framework specifically targeting on-device LLM personalization with a complete pipeline for selection and synthesis, though primarily an engineering integration of known concepts.

⚙️ Technical Details

Problem Definition

Setting: Online fine-tuning of an LLM on a sequence of user-generated dialogue pairs under storage and annotation constraints

Inputs: Streaming unlabeled dialogue sets T (Question, Answer)

Outputs: Personalized LLM parameters fine-tuned on a small buffer of selected data

Pipeline Flow

Input Stream Processing -> Quality Metric Calculation
Buffer Management -> User Annotation (Sparse)
Data Synthesis -> Fine-tuning

System Modules

Metric Calculator (Data Selection)

Computes EOE, DSS, and IDD scores for each incoming dialogue set to evaluate its potential value

Model or implementation: Pretrained transformer model (for embedding generation)

Buffer Manager (Data Selection)

Maintains the data buffer by replacing the lowest-quality existing data with new high-quality data

Model or implementation: Heuristic Policy (Greedy replacement)

Data Synthesizer

Generates additional training samples semantically similar to the selected buffer data to improve fine-tuning stability

Model or implementation: The LLM itself (Self-generated instruction)

Novel Architectural Elements

Integration of three specific unsupervised metrics (Entropy, Domain, Dissimilarity) for buffer replacement in streaming text
Coupling buffer selection with immediate self-supervised synthesis to artificially expand the effective training set size on edge devices

Modeling

Base Model: Llama-3B (example cited for edge deployment)

Training Method: Supervised Fine-tuning (SFT) on selected buffer data

Objective Functions:

Purpose: Maximize data quality in buffer.

Formally: Maximize EOE(T), DSS(T), and IDD(T) via replacement policy.

Training Data:

Datasets: ALPACA, DOLLY, MedDialog, Prosocial-Dialog, OPENORCA, Empathetic-Dialog
Simulated streaming data with temporal correlations

Compute: Not reported in the paper

Comparison to Prior Work

vs. Random Sampling: Selects data based on semantic content and information density rather than uniform probability
vs. Standard Continual Learning: Incorporates data synthesis to augment the small buffer, addressing data scarcity on edge devices
vs. Cloud-based Fine-tuning [not cited in paper]: Performs all selection and training locally to preserve privacy

Limitations

Requires a pre-stored dictionary for Domain Specific Score (DSS) calculation
Data synthesis relies on the LLM's own capability, which might be limited on smaller edge models
Sanity check for synthesis uses ROUGE-1, which may not capture semantic errors effectively

Reproducibility

Code availability is not provided. The paper describes the metrics (EOE, DSS, IDD) mathematically but does not specify the exact pre-trained transformer used for embedding or the specific hyperparameters for the fine-tuning process (learning rate, batch size).

📊 Experiments & Results

Evaluation Setup

Simulated on-device learning using streaming datasets with varying temporal correlation

Benchmarks:

ALPACA (Instruction Following)
MedDialog (Medical Dialogue)
DOLLY (Instruction Following)
Prosocial-Dialog (Safety/Social Dialogue)

Metrics:

ROUGE-1
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Proposed framework consistently outperforms vanilla baselines (up to 38% higher ROUGE-1) across multiple datasets including ALPACA and MedDialog
The combination of Entropy, Domain Score, and Dissimilarity effectively selects high-value data from temporally correlated streams
Self-supervised data synthesis successfully augments the small buffer, allowing for effective fine-tuning without requiring large-scale storage
Method is robust to different types of dialogue domains, from general instruction following (ALPACA) to specific domains (Medical)

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model fine-tuning
Basics of Continual Learning (replay buffers)
Text embedding and cosine similarity

Key Terms

EOE: Entropy of Embedding—a metric measuring the information content of a text vector based on token probability distribution

DSS: Domain Specific Score—measures the overlap between text tokens and pre-defined lexicons for specific domains (e.g., medical, emotion)

IDD: In-Domain Dissimilarity—measures how distinct a new data point is compared to existing buffer data that shares the same dominant domain

IID: Independent and Identically Distributed—a statistical assumption often violated by streaming user data which is temporally correlated

ROUGE-1: Recall-Oriented Understudy for Gisting Evaluation—a metric measuring the overlap of unigrams (single words) between generated text and reference text

Self-supervised: Learning or selecting data without explicit human labels, often using intrinsic properties of the data itself

Dialogue Set: The atomic unit of data selection, consisting of a question and an answer pair