Hyungjune Bu, Chanjoo Jung, Minjae Kang, Jaehyung Kim
Yonsei University,
Opt-AI Inc.
arXiv
(2025)
P13NRL
📝 Paper Summary
Personalized Text GenerationDecoding Strategies
CoPe personalizes LLM outputs by contrasting the logits of a user-tuned model against a base model during decoding, effectively maximizing an implicit user reward without external reward models.
Core Problem
Existing personalization methods are either prompt-based (limited memory, no learning) or training-based (costly, prone to forgetting), while decoding-time strategies for personalization remain unexplored.
Why it matters:
Generic LLMs fail to align with individual writing styles and preferences required for assistants and recommendation systems.
Frequent retraining of full models to reflect evolving user preferences is computationally prohibitive and risks catastrophic forgetting.
Prompting methods struggle with context length limitations and cannot deeply internalize user behavior like parametric methods can.
Concrete Example:A generic model might answer a news query with a neutral summary, whereas a specific user might prefer a witty, editorial-style headline. Standard fine-tuning might lose general knowledge, while CoPe dynamically steers the generic model's output toward the user's style during generation.
Key Novelty
Implicit Reward-Guided Decoding (CoPe)
Leverages the insight that the log-likelihood ratio between a user-tuned model and a base model serves as an 'implicit reward' signal for user preference.
Implements this signal via contrastive decoding: the model boosts tokens favored by the personalized adapter while penalizing generic tokens favored by the base model.
Introduces a training scheme that synthesizes 'negative' user examples (generic outputs) to optimize the adapter via Direct Preference Optimization (DPO) before decoding.
Architecture
Overview of the CoPe pipeline including the training phase (constructing preference datasets via implicit reward) and the decoding phase (contrasting logits).
Evaluation Highlights
Achieves an average relative improvement of 10.57% in ROUGE-L across five open-ended generation tasks compared to standard task-finetuned models.
Outperforms a simply personalized model (without contrastive decoding) by an average of 5.67% in ROUGE-L, demonstrating the specific value of the decoding strategy.
Demonstrates generalization across different model scales and architectures (Llama-2, Mistral, Solar) without requiring external reward models.
Breakthrough Assessment
7/10
Offers a clever link between contrastive decoding and implicit rewards for personalization. While mathematically grounded in existing concepts (DPO/Contrastive Decoding), applying it to personalization to avoid external reward models is a practical and effective innovation.
⚙️ Technical Details
Problem Definition
Setting: Open-ended personalized text generation
Inputs: Input query x and historical interaction data H_user
Outputs: Personalized response y aligned with user preferences
Pipeline Flow
Group: Inference Pipeline -> Base Model & Personalized Model -> Logit Contrast -> Token Selection
System Modules
Base Model (Inference Pipeline)
Provides the generic/reference probability distribution for the next token
Model or implementation: Pre-trained LLM (e.g., Llama-2, Mistral)
Personalized Model (Inference Pipeline)
Provides the user-adapted probability distribution
Model or implementation: Base Model + User-specific LoRA Adapter
CoPe Decoder (Inference Pipeline)
Calculates implicit reward and selects next token
Model or implementation: Mathematical operation (Subtraction & Argmax)
Novel Architectural Elements
Use of the log-likelihood ratio between a PEFT adapter and its base model as a direct proxy for 'user preference reward' during decoding.
Modeling
Base Model: Evaluated on Llama-2-7b, Mistral-7B-v0.1, Solar-10.7B
Training Method: PEFT (LoRA) followed by Direct Preference Optimization (DPO)
Purpose: Train the adapter to distinguish user style from generic style.
Formally: DPO loss minimizing -log(sigmoid(r_user(y_pos) - r_user(y_neg)))
Adaptation: LoRA (Low-Rank Adaptation)
Trainable Parameters: User-specific LoRA modules only
Training Data:
Positive examples: Actual user historical responses
Negative examples: Synthetic responses generated by Base Model that have the *lowest* implicit reward (least likely to be user)
Key Hyperparameters:
alpha (decoding): Contrastive weight hyperparameter (value not explicitly in snippet)
alpha (negative mining): 1 (explicitly stated for Eq 5)
tau: Adaptive threshold hyperparameter for candidate set pruning
Compute: Not reported in the paper
Comparison to Prior Work
vs. SFT/One PEFT per User: CoPe adds a decoding-time guidance term (contrastive) rather than relying solely on the tuned weights.
vs. RAG: CoPe updates model parameters efficiently (PEFT) and doesn't rely on context window retrieval.
vs. Standard Contrastive Decoding: CoPe contrasts *User vs. Base* to isolate personal preference, whereas standard CD contrasts *Expert vs. Amateur* to isolate quality/truthfulness.
Code is publicly available at https://github.com/cleverscent/CoPe. Hyperparameters (alpha, tau) are mentioned as existing but specific values for experiments are likely in the appendix (not provided in text).
📊 Experiments & Results
Evaluation Setup
Personalized open-ended text generation using user history
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
LaMP & LongLaMP (Average)
ROUGE-L (Relative Improvement)
0 (Reference Baseline)
10.57
+10.57%
LaMP & LongLaMP (Average)
ROUGE-L (Relative Improvement)
0 (Reference Baseline)
5.67
+5.67%
Main Takeaways
Contrastive decoding significantly boosts personalization performance (ROUGE-L) compared to standard fine-tuning alone.
Synthesizing negative examples using the base model (finding outputs with low implicit user reward) effectively enables DPO training without needing real negative user data.
The approach generalizes across different LLM architectures (Llama-2, Mistral, Solar), suggesting the implicit reward signal is a robust proxy for user preference.
📚 Prerequisite Knowledge
Prerequisites
Parameter-Efficient Fine-Tuning (PEFT)
Log-likelihood and Logits
Contrastive Decoding
Direct Preference Optimization (DPO)
Key Terms
PEFT: Parameter-Efficient Fine-Tuning—methods like LoRA that adapt large models by updating only a small subset of parameters.
LoRA: Low-Rank Adaptation—a specific PEFT technique that injects trainable low-rank matrices into frozen model layers.
Contrastive Decoding: A decoding strategy that selects tokens that are probable in a 'strong' model but improbable in a 'weak' (or generic) model to improve quality.
Implicit Reward: A reward signal derived mathematically from the ratio of probabilities between two models (e.g., policy vs. reference) rather than from a separate trained reward model.
DPO: Direct Preference Optimization—a training method that aligns models to preferences by optimizing the likelihood of preferred responses over dispreferred ones directly.
ROUGE-L: A metric measuring text overlap between generated output and a reference, focusing on the longest common subsequence.
Best-of-N sampling: A method of generating N candidate responses and selecting the best one according to a reward function.