TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights

📝 Paper Summary

LLM Alignment Direct Preference Optimization (DPO) Token-level reward modeling

TIS-DPO improves language model alignment by assigning unique importance weights to each token during training, rather than treating entire responses as uniformly good or bad.

Core Problem

Standard DPO (Direct Preference Optimization) treats a whole response as a single unit, ignoring that 'winning' responses often contain poor tokens and 'losing' responses contain good tokens.

Why it matters:

Uniformly increasing probabilities for all tokens in a winning response reinforces mistakes or low-quality segments within that response.
The assumption that all tokens in a preferred response are equally good introduces significant noise, reducing optimization efficiency and final model performance.

Concrete Example: In a summarization task, a 'winning' summary might be preferred overall but still contain a hallucinated fact or a grammatical error. Standard DPO boosts the probability of that error just as much as the correct parts. TIS-DPO would assign a low weight to the error token, preventing the model from learning it.

Key Novelty

Token-level Importance Sampling DPO (TIS-DPO)

Hypothesizes an 'optimal' dataset where every token in a winning response is equally good. Since real data isn't like this, it uses importance sampling to adjust the training loss.
Estimates token-level quality (weights) by comparing probabilities from two 'contrastive' models—one biased towards good responses and one towards bad responses.
Reweights the standard DPO loss per token: high-quality tokens in winners get higher weights; low-quality tokens get lower weights, effectively denoising the signal.

Architecture

The workflow of TIS-DPO, illustrating the two-step process: (1) Token Weight Estimation using contrastive LLMs, and (2) Weighted DPO Optimization.

Evaluation Highlights

+3.45% win rate improvement over standard DPO on the Anthropic-HH harmlessness benchmark using Llama-2-7B.
+2.37% win rate improvement over DPO on the PKU-RLHF helpfulness benchmark.
Achieves higher reward scores while maintaining lower KL divergence (staying closer to the reference model) compared to baselines like IPO and KTO.

Breakthrough Assessment

7/10

Offers a theoretically grounded improvement to DPO with practical estimation methods. The gains are consistent across tasks, addressing a known granularity issue in preference learning.

⚙️ Technical Details

Problem Definition

Setting: Offline preference optimization for LLMs using paired preference data (winning/losing responses)

Inputs: Prompt x and a pair of responses (y_w, y_l) where y_w is preferred over y_l

Outputs: Optimized policy π_θ that maximizes implicit reward consistent with preferences

Pipeline Flow

Token-level Weight Estimation (Offline Step)
Weighted DPO Training (Main Step)

System Modules

Contrastive Model Estimator

Estimate token-level rewards by comparing probabilities from a 'positive' model and a 'negative' model

Model or implementation: Various (Prompt-based, SFT-based, or DPO-based pairs)

TIS-DPO Trainer

Optimize the policy model using the token-weighted DPO objective

Model or implementation: Target LLM (e.g., Llama-2-7B, Mistral-7B)

Novel Architectural Elements

Integration of token-level importance weights directly into the analytic solution of the DPO loss function
Use of 'Backward DPO' (training on swapped labels) to generate a strong negative contrastive model for weight estimation

Modeling

Base Model: Llama-2-7B, Mistral-7B-v0.1

Training Method: TIS-DPO (Token-level Importance Sampling DPO)

Objective Functions:

Purpose: Maximize the margin between preferred and dispreferred responses, weighted by token importance.

Formally: L_TIS-DPO = -E [log σ( β * (w^w * log(π/π_ref)(y_w) - w^l * log(π/π_ref)(y_l)) )]

Adaptation: Full fine-tuning

Training Data:

PKU-RLHF (safety/helpfulness)
Anthropic-HH (helpful & harmless)
TL;DR (summarization)

Key Hyperparameters:

beta: 0.1
learning_rate: 5e-7 (Llama-2), 1e-6 (Mistral)
batch_size: 64 (accumulated)
+ 2 more
epochs: 1 (TL;DR), 2 (Anthropic-HH), 3 (PKU-RLHF)
max_length: 1024 (Llama-2), 2048 (Mistral)

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. DPO: DPO weights all tokens equally; TIS-DPO weights them by estimated importance.
vs. TDPO: TDPO adds token-level KL regularization but doesn't explicitly model token importance weights based on reward estimation; TIS-DPO explicitly derives weights.
vs. PPO [not cited in paper]: PPO uses a learned value function for token-level updates; TIS-DPO estimates weights statically or via contrastive models without a separate value network during the main optimization loop.
+ 1 more
vs. RSO [not cited in paper]: RSO samples from the optimal policy; TIS-DPO uses importance sampling to approximate the optimal distribution using the original data.

Limitations

Computational overhead of creating contrastive models (requires training two extra models for the SFT/DPO-based estimation methods).
Sensitivity to the quality of the estimated weights; poor estimates could degrade performance.
The theoretical derivation assumes an 'optimal' dataset distribution that is approximated, which may not hold perfectly in practice.
Main experiments focus on 7B models; scaling to larger models is not explicitly tested.

Reproducibility

Code: https://github.com/exlaw/TIS-DPO

Code is publicly available. The paper describes three specific methods for weight estimation (Prompt-based, SFT-based, DPO-based) with hyperparameters provided in Appendix B. Datasets used are public benchmarks.

📊 Experiments & Results

Evaluation Setup

evaluated on harmlessness, helpfulness, and summarization tasks using GPT-4 as a judge and standard metrics.

Benchmarks:

PKU-RLHF (Safety and Helpfulness alignment)
Anthropic-HH (Helpful and Harmless dialogue)
TL;DR (Summarization)

Metrics:

Win Rate (vs. chosen response)
GPT-4 Win Rate (vs. Baseline)
ROUGE-1/2/L
Reward Score (using proxy reward model)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative results on the Anthropic-HH dataset (Harmless subset) using Llama-2-7B show TIS-DPO outperforming DPO and other baselines in win rates against the ground truth.
Anthropic-HH (Harmless)	Win Rate (vs. Target)	59.26	62.71	+3.45
Anthropic-HH (Helpful)	Win Rate (vs. Target)	52.33	54.02	+1.69
Results on the PKU-RLHF dataset (Helpful subset) using Llama-2-7B.
PKU-RLHF (Helpful)	Win Rate (vs. Target)	61.35	63.72	+2.37
Summarization performance on TL;DR using Mistral-7B, evaluating generation quality via ROUGE scores.
TL;DR	ROUGE-L	32.48	33.62	+1.14
GPT-4 Evaluation win rates comparing TIS-DPO directly against standard DPO on Anthropic-HH.
Anthropic-HH (Harmless)	GPT-4 Win Rate (TIS-DPO vs DPO)	31.00	45.00	+14.00

Experiment Figures

Comparison of Reward vs. KL Divergence curves for DPO, IPO, and TIS-DPO on the Anthropic-HH dataset.

Visualization of estimated token weights on a sample response.

Main Takeaways

TIS-DPO consistently outperforms standard DPO across safety, helpfulness, and summarization tasks, indicating that token-level weighting is beneficial.
Among the three weight estimation methods (Prompt, SFT, DPO-based), the DPO-based method (Forward + Backward DPO) generally yields the best performance.
The method achieves higher rewards at lower KL divergences compared to DPO, suggesting a better Pareto frontier between alignment and preserving the base model's capabilities.
Visualizations of weights show the method successfully assigns lower weights to common prefixes/suffixes and higher weights to content-rich, distinguishing tokens.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
Importance Sampling
Bradley-Terry Model

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

DPO: Direct Preference Optimization—an algorithm that optimizes language models to align with human preferences by solving for the optimal policy directly without an explicit reward model loop

RLHF: Reinforcement Learning from Human Feedback—a technique to align models using human preference data, typically involving a reward model and PPO

PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used in RLHF to update the model policy

Importance Sampling: A statistical technique used to estimate properties of a target distribution using samples from a different distribution by reweighting them

Contrastive LLMs: A pair of language models where one is biased towards generating preferred (winning) responses and the other towards non-preferred (losing) responses

Forward/Backward DPO: A method to create contrastive models: Forward trains on correct preferences (y_w > y_l), Backward trains on swapped preferences (y_l > y_w) to create a 'bad' model

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution

Bradley-Terry model: A statistical model that predicts the probability of one item being preferred over another based on their latent reward scores

Partition function: A normalization factor in probability distributions, often denoted as Z(x)

IPO: Identity Preference Optimization—a DPO variant adding a regularization term to the loss

KTO: Kahneman-Tversky Optimization—a preference optimization method based on prospect theory that defines utility directly on outputs

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization and machine translation