Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

📝 Paper Summary

Machine Translation Preference Optimization LLM Fine-tuning

Contrastive Preference Optimization improves machine translation by training models to distinguish between high-quality and slightly flawed translations using preference pairs derived from reference-free evaluation, rather than just mimicking gold references.

Core Problem

Supervised Fine-Tuning (SFT) in machine translation is limited because it forces models to mimic 'gold' references that are often inferior to model-generated outputs, and it lacks a mechanism to teach models to reject near-perfect but flawed translations.

Why it matters:

Training on imperfect human references caps model performance at the level of the data, preventing super-human performance
Standard SFT does not penalize minor errors (like hallucinations or omissions) effectively because it only optimizes for likelihood of the reference
Existing preference optimization methods like DPO are memory and speed inefficient for large translation models

Concrete Example: In a FLORES-200 example, the human reference translation omits part of the source sentence information. However, SFT would force the model to learn this omission. A strong model (like GPT-4) generates a correct, complete translation. CPO allows the model to prefer the GPT-4 output over the flawed human reference.

Key Novelty

Contrastive Preference Optimization (CPO)

Approximates Direct Preference Optimization (DPO) by assuming the reference policy is a uniform prior, eliminating the need to load a second reference model during training (memory efficient)
Constructs 'gold' preference pairs dynamically using reference-free metrics (KIWI-XXL, XCOMET) to identify when model outputs are actually better than human references
Adds a behavior cloning regularizer (standard negative log-likelihood) on the preferred data to ensure the model maintains generation capability

Architecture

The data construction and triplet selection process for CPO training.

Evaluation Highlights

ALMA-R (13B) matches or exceeds GPT-4 performance on WMT'21, WMT'22, and WMT'23 test datasets
Achieves these results by tuning only 0.1% of parameters (12M params) using a small dataset of 22K parallel sentences
Analysis reveals that model-generated translations (ALMA-13B-LoRA) are preferred over human 'gold' references 73.24% of the time by KIWI-XXL metric on xx->en tasks

Breakthrough Assessment

8/10

Significantly challenges the 'gold reference' dogma in MT and provides a highly efficient method (CPO) to surpass SOTA models with minimal parameter updates.

⚙️ Technical Details

Problem Definition

Setting: Machine Translation as a preference learning problem

Inputs: Source sentence x

Outputs: Target translation y

Pipeline Flow

Data Construction: Generate triplet (Reference, GPT-4 Output, ALMA Output)
Scoring: Score all three candidates with KIWI-XXL/XCOMET
Selection: Identify Winner (y_w) and Loser (y_l)
Training: Update ALMA model using CPO Loss (Preference term + NLL term)

System Modules

Base Translator

Generate initial translations for preference triplet construction

Model or implementation: ALMA-13B-LoRA (and GPT-4 for external candidates)

Scorer

Evaluate translation quality without relying on reference

Model or implementation: KIWI-XXL and XCOMET-XXL

Policy Model (ALMA-R)

Learn to translate by maximizing margin between winner and loser

Model or implementation: ALMA-13B with LoRA adapters

Modeling

Base Model: ALMA-13B-LoRA (based on LLaMA-2)

Training Method: Contrastive Preference Optimization (CPO)

Objective Functions:

Purpose: Maximize the likelihood margin between preferred and dis-preferred translations while retaining generation capability.

Formally: L(theta) = L_prefer + L_NLL
Purpose: Preference learning term (approximated DPO).

Formally: -log(sigmoid(beta * log(pi_theta(y_w|x)) - beta * log(pi_theta(y_l|x))))
Purpose: Regularization term (Behavior Cloning) to prevent forgetting.

Formally: -E[log(pi_theta(y_w|x))]

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: 12M (approx. 0.1% of model size)

Training Data:

Dataset: FLORES-200 (dev and test sets)
Size: 22K parallel sentences across 10 directions
Triplets formed by: Human Reference, GPT-4 output, ALMA-13B-LoRA output

Compute: Not reported in the paper

Comparison to Prior Work

vs. SFT: CPO learns from negative examples (mistakes/inferior translations) and isn't bound by reference quality
vs. DPO: CPO is memory-efficient (no reference model required) and includes a mandatory NLL term for stability in MT tasks

Limitations

Relies on the quality of reference-free evaluation metrics (KIWI/XCOMET) to label preferences correctly
Requires generating multiple translation candidates (GPT-4, ALMA) to construct the training data, which has a computational cost
Experiments focused on 10 language directions; scaling to massive multilingual settings not explicitly detailed

Reproducibility

Code: https://github.com/fe1ixxu/ALMA

Code and ALMA-R models are publicly available at https://github.com/fe1ixxu/ALMA. The paper details the specific reference-free metrics (KIWI-XXL, XCOMET-XXL) used for data construction.

📊 Experiments & Results

Evaluation Setup

Evaluation of translation quality on standard WMT benchmarks and analysis of reference quality on FLORES-200

Benchmarks:

WMT'21, WMT'22, WMT'23 (Machine Translation Test Sets)
FLORES-200 (Machine Translation (used for analysis and training data))

Metrics:

KIWI-XXL (Reference-free)
XCOMET (Reference-free)
Win Ratio (vs Gold Reference)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of training data quality reveals that 'Gold' human references are frequently inferior to model-generated translations, motivating the need for preference optimization over simple imitation.
FLORES-200 (xx->en average)	Win Ratio (KIWI-XXL)	0.00	73.24	+73.24
FLORES-200 (xx->en average)	Win Ratio (XCOMET)	0.00	60.17	+60.17
FLORES-200 (en->xx average)	Win Ratio (KIWI-XXL)	0.00	41.87	+41.87

Experiment Figures

Performance comparison of ALMA-R against ALMA, GPT-3.5, GPT-4, and WMT winners.

A qualitative example comparing a Gold Reference vs. ALMA/GPT-4 outputs.

Main Takeaways

Human references in standard datasets (FLORES-200) are often 'gilded' rather than gold, with models like ALMA and GPT-4 frequently producing superior translations.
The proposed ALMA-R model (trained with CPO) matches or exceeds GPT-4 and WMT competition winners on WMT'21, '22, and '23 test sets (quantitative deltas for ALMA-R specifically not extracted from text, visualized in Figure 1).
CPO effectively utilizes 'dis-preferred' translations—which may still be high quality but imperfect—to teach the model to avoid minor errors, a signal SFT ignores.

📚 Prerequisite Knowledge

Prerequisites

Basics of Transformer-based Machine Translation
Supervised Fine-Tuning (SFT)
Reinforcement Learning from Human Feedback (RLHF) concepts

Key Terms

CPO: Contrastive Preference Optimization—a memory-efficient objective that trains models to prefer higher-scored translations over lower-scored ones without needing a reference model in memory

DPO: Direct Preference Optimization—an algorithm that optimizes language models to align with preferences by solving for the optimal policy in closed form, typically requiring a reference model

SFT: Supervised Fine-Tuning—training a model to minimize the difference between its output and a labeled dataset (gold references)

KIWI-XXL: A large-scale reference-free evaluation model for machine translation that correlates highly with human judgments

XCOMET: A reference-free quality estimation metric for machine translation

Gold Reference: The human-written translation provided in a parallel dataset, traditionally treated as the ground truth

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices