Decoding-Time Language Model Alignment with Multiple Objectives

📝 Paper Summary

Language Model Alignment Multi-Objective Optimization

MOD combines output distributions from multiple single-objective aligned models at decoding time to optimally satisfy arbitrary preference weightings without retraining.

Core Problem

Existing alignment methods (like PPO/DPO) optimize for a single reward function, but real-world applications require trading off multiple conflicting objectives (e.g., helpfulness vs. safety) based on varying user preferences.

Why it matters:

Retraining models for every possible combination of user preferences is computationally prohibitive
Prompt engineering fails to provide precise, granular control over the weighting of output characteristics
Parameter merging (weight interpolation) is theoretically sub-optimal for non-linear objectives like KL-divergence regularization

Concrete Example: A dialogue agent needs to balance 'helpfulness' and 'harmlessness'. A user might want 70% helpfulness and 30% safety, while another wants 50/50. Current methods require either retraining a new model for each ratio or hoping a single model can generalize, often failing to hit the specific desired trade-off.

Key Novelty

Multi-Objective Decoding (MOD) via Legendre Transform

Uses the mathematical property (Legendre transform) of f-divergence regularized objectives (like PPO/DPO) to derive a closed-form solution for optimal multi-objective policies
Show that the optimal policy for a weighted sum of rewards is a weighted geometric mean of the base policies (for KL-divergence), not a linear interpolation of weights
Implements this solution as a simple token-level decoding strategy that linearly combines the logits of base models according to preference weights

Architecture

Illustration of the MOD pipeline compared to training-based methods. It shows multiple base models processing the same input, their output distributions being aggregated via a weighted geometric mean (linear in log-space), resulting in a final token distribution.

Evaluation Highlights

+12.8% overall reward improvement compared to parameter merging (Rewarded Soups) when equally optimizing three objectives on the Helpful Assistant task
Reduces toxicity to nearly 0% while achieving 7.9% to 33.3% improvement across three other metrics (Codex@1, GSM-COT, BBH-COT) when combining three Tülu models
Successfully combines heterogeneous models (two 13B DPO models and one 7B SFT model) for open instruction following, demonstrating flexibility across scales and training methods

Breakthrough Assessment

8/10

Provides a strong theoretical foundation (convex optimization) for a simple, effective practical method (logit mixing). Solves the multi-objective problem without retraining, a significant efficiency gain.

⚙️ Technical Details

Problem Definition

Setting: Decoding a response y given input x to maximize a weighted sum of M reward functions regularized by a reference policy

Inputs: Prompt x, set of base policies {π_i} trained for single rewards, preference weights w

Outputs: Token sequence y sampled from the aggregated distribution

Pipeline Flow

Input Processing (Prompt x)
Base Model Parallel Inference (Compute logits for π_ref and all π_i)
Logit Aggregation (Combine logits using weights w)
Sampling (Select next token y_t)

System Modules

Base Policies

Generate next-token probability distributions for specific single objectives

Model or implementation: Various (Llama-2-7B, Llama-2-13B, Tülu 2)

Aggregator

Combine outputs of base policies according to preference weights w

Model or implementation: Closed-form algebraic equation (Eq. 6/7)

Novel Architectural Elements

Inference-time logit mixing derived specifically from Legendre transform of f-divergence objectives (the specific aggregation formula is the novelty, not just the concept of mixing)

Modeling

Base Model: Llama-2 (7B and 13B) and Tülu 2 (7B, 13B, 70B)

Training Method: PPO (Proximal Policy Optimization) and DPO (Direct Preference Optimization) for base models

Objective Functions:

Purpose: Base models optimize single rewards with KL penalty.

Formally: max_π E[R(y|x)] - β D_KL(π || π_ref)

Training Data:

Reddit Summary (Summarize-from-Feedback)
Helpful Assistant (Anthropic-HH)
Safety Alignment (BeaverTails)
Open Instruction-Following (UltraFeedback, UltraSafety, CodeAlpaca)

Key Hyperparameters:

kl_coefficient_beta: 0.05 or 0.1 (depending on task)
learning_rate: 1e-6 (PPO), 5e-7 (DPO)
batch_size: 128 or 64

Compute: Requires loading multiple models simultaneously during inference (mitigated by LoRA or distributed serving). Training cost is zero for the aggregation step itself.

Comparison to Prior Work

vs. Rewarded Soups: MOD operates in output (probability) space rather than parameter space, which is theoretically optimal for KL-regularized objectives
vs. MORLHF/MODPO: MOD is training-free at the combination stage; it combines existing single-objective models dynamically, whereas others require retraining for new weights
vs. Proxy Tuning [not cited in paper]: Similar logit arithmetic, but MOD focuses on multi-objective interpolation rather than steering a base model with a small tuned expert

Limitations

Inference memory cost scales linearly with the number of objectives (need to load M models), though adapters mitigate this.
Requires base models to share the same vocabulary (tokenizer).
Does not produce a single static model checkpoint; requires the decoding algorithm for deployment.

Reproducibility

Code: https://github.com/ruizhesh/MOD

Code is publicly available at https://github.com/ruizhesh/MOD. Base models are standard open weights (Llama-2, Tülu).

📊 Experiments & Results

Evaluation Setup

Decoding with different weight vectors w to map the Pareto frontier of trade-offs between objectives

Benchmarks:

Reddit Summary (Summarization)
Helpful Assistant (Dialogue (Helpfulness vs Harmlessness))
Safety Alignment (Safety vs Helpfulness)
Open Instruction-Following (General capabilities (Tülu models))

Metrics:

Reward Scores (from off-the-shelf Reward Models)
Pareto Frontier Area (Hypervolume)
Win-rate vs Reference
Toxicity (Toxigen)
Accuracy (GSM-COT, BBH-COT, Codex@1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Helpful Assistant	Overall Reward Improvement	Not reported in the paper	Not reported in the paper	+12.8%
Combination of three Tülu models (Safety, Coding, General) shows MOD can simultaneously improve multiple distinct capabilities.
Open Instruction-Following	Toxicity (Toxigen)	Not reported in the paper	0.0	Reduced to ~0
Open Instruction-Following	Codex@1	Not reported in the paper	Not reported in the paper	+33.3%
Reddit Summary	Pareto Frontier Dominance	See Figure 3	See Figure 3	Positive gap

Experiment Figures

Pareto frontiers for Reddit Summary and Helpful Assistant tasks comparing MOD against Rewarded Soups and MORLHF.

3D scatter plot of rewards for 3 objectives (Helpfulness, Harmlessness, Humour) on the Helpful Assistant task.

Main Takeaways

MOD consistently outperforms parameter merging (Rewarded Soups) across all tasks, establishing that output-space mixing is superior to parameter-space mixing for alignment.
The method allows for 'plug-and-play' alignment: models of different sizes (7B + 13B) and training methods (SFT + DPO) can be combined effectively.
Steerability is verified: changing the weight vector w smoothly transitions the model's behavior along the Pareto frontier.
Negative weights are handled effectively, allowing the model to steer *away* from certain behaviors (unlike Rewarded Soups which struggles with negative coefficients).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
f-divergences (specifically KL-divergence)
Convex Optimization (Legendre transform, convex conjugates)

Key Terms

MOD: Multi-Objective Decoding—the proposed algorithm that combines predictions of base models at inference time

Legendre transform: A mathematical operation that relates a function to its convex conjugate; used here to map between reward space and policy space

PPO: Proximal Policy Optimization—an RL algorithm used for alignment

DPO: Direct Preference Optimization—an algorithm optimizing a policy to satisfy preferences without an explicit reward model

strong-barrier function: A regularizing function (like x log x for KL) that is continuously differentiable and strongly convex, allowing a bijective mapping between rewards and policies

Pareto frontier: The set of optimal solutions where no objective can be improved without degrading another

SFT: Supervised Fine-Tuning—the initial training phase of a model on instruction data

logit: The raw, unnormalized output scores of a neural network before the softmax layer

f-divergence: A family of measures quantifying the difference between two probability distributions (includes KL, Reverse KL, etc.)