Ruiyang Qin, Jun Xia, Zhenge Jia, Meng Jiang, Ahmed Abbasi, Peipei Zhou, Jingtong Hu, Yiyu Shi
University of Notre Dame,
University of Pittsburgh
Design Automation Conference
(2023)
MemoryP13N
📝 Paper Summary
Linear memory (buffer management)Conversational personalization
A framework for on-device LLM personalization selects representative data via unsupervised metrics and synthesizes semantic variations to enable efficient fine-tuning with limited storage and sparse annotation.
Core Problem
On-device personalization faces conflicting constraints: limited storage prevents keeping all user data, privacy prevents cloud offloading, and user annotations must remain sparse to avoid annoyance.
Why it matters:
Generic pre-trained models fail to adapt to individual user contexts, preferences, and unique interaction habits in real-time
Standard fine-tuning assumes large storage and IID (Independent and Identically Distributed) data sampling, which is impossible with streaming edge data
Existing continual learning methods struggle with temporally correlated streams where data value varies significantly over time
Concrete Example:A user interacts with a robot assistant. The stream contains repetitive, low-value 'uncontroversial dialogue' before switching to a useful, unique interaction. Standard buffers might fill up with the repetitive data due to temporal correlation, discarding the unique interaction and preventing personalization.
Key Novelty
Self-Supervised Data Selection and Synthesis (SDSS)
Selects data for a small memory buffer using three unsupervised metrics: entropy (information content), domain score (relevance), and dissimilarity (uniqueness vs. buffer)
Augments the small selected dataset by prompting the LLM to synthesize multiple semantically similar question-answer pairs, acting as a data multiplier without user effort
Architecture
The three-stage framework: (1) Data Selection using quality metrics, (2) Data Synthesis using the LLM, and (3) Fine-tuning.
Evaluation Highlights
Achieves up to 38% higher ROUGE-1 score compared to vanilla baselines on datasets like ALPACA and MedDialog
Demonstrates improved learning speed and content-generating accuracy by fine-tuning only on high-value, representative data rather than random samples
Breakthrough Assessment
7/10
First framework specifically targeting on-device LLM personalization with a complete pipeline for selection and synthesis, though primarily an engineering integration of known concepts.
⚙️ Technical Details
Problem Definition
Setting: Online fine-tuning of an LLM on a sequence of user-generated dialogue pairs under storage and annotation constraints
Inputs: Streaming unlabeled dialogue sets T (Question, Answer)
Outputs: Personalized LLM parameters fine-tuned on a small buffer of selected data
Simulated streaming data with temporal correlations
Compute: Not reported in the paper
Comparison to Prior Work
vs. Random Sampling: Selects data based on semantic content and information density rather than uniform probability
vs. Standard Continual Learning: Incorporates data synthesis to augment the small buffer, addressing data scarcity on edge devices
vs. Cloud-based Fine-tuning [not cited in paper]: Performs all selection and training locally to preserve privacy
Limitations
Requires a pre-stored dictionary for Domain Specific Score (DSS) calculation
Data synthesis relies on the LLM's own capability, which might be limited on smaller edge models
Sanity check for synthesis uses ROUGE-1, which may not capture semantic errors effectively
Reproducibility
Code availability is not provided. The paper describes the metrics (EOE, DSS, IDD) mathematically but does not specify the exact pre-trained transformer used for embedding or the specific hyperparameters for the fine-tuning process (learning rate, batch size).
📊 Experiments & Results
Evaluation Setup
Simulated on-device learning using streaming datasets with varying temporal correlation
Benchmarks:
ALPACA (Instruction Following)
MedDialog (Medical Dialogue)
DOLLY (Instruction Following)
Prosocial-Dialog (Safety/Social Dialogue)
Metrics:
ROUGE-1
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
Proposed framework consistently outperforms vanilla baselines (up to 38% higher ROUGE-1) across multiple datasets including ALPACA and MedDialog
The combination of Entropy, Domain Score, and Dissimilarity effectively selects high-value data from temporally correlated streams
Self-supervised data synthesis successfully augments the small buffer, allowing for effective fine-tuning without requiring large-scale storage
Method is robust to different types of dialogue domains, from general instruction following (ALPACA) to specific domains (Medical)
📚 Prerequisite Knowledge
Prerequisites
Understanding of Large Language Model fine-tuning
Basics of Continual Learning (replay buffers)
Text embedding and cosine similarity
Key Terms
EOE: Entropy of Embedding—a metric measuring the information content of a text vector based on token probability distribution
DSS: Domain Specific Score—measures the overlap between text tokens and pre-defined lexicons for specific domains (e.g., medical, emotion)
IDD: In-Domain Dissimilarity—measures how distinct a new data point is compared to existing buffer data that shares the same dominant domain
IID: Independent and Identically Distributed—a statistical assumption often violated by streaming user data which is temporally correlated
ROUGE-1: Recall-Oriented Understudy for Gisting Evaluation—a metric measuring the overlap of unigrams (single words) between generated text and reference text
Self-supervised: Learning or selecting data without explicit human labels, often using intrinsic properties of the data itself
Dialogue Set: The atomic unit of data selection, consisting of a question and an answer pair