Evaluation Setup
Evaluation of translation quality on standard WMT benchmarks and analysis of reference quality on FLORES-200
Benchmarks:
- WMT'21, WMT'22, WMT'23 (Machine Translation Test Sets)
- FLORES-200 (Machine Translation (used for analysis and training data))
Metrics:
- KIWI-XXL (Reference-free)
- XCOMET (Reference-free)
- Win Ratio (vs Gold Reference)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Analysis of training data quality reveals that 'Gold' human references are frequently inferior to model-generated translations, motivating the need for preference optimization over simple imitation. |
| FLORES-200 (xx->en average) |
Win Ratio (KIWI-XXL) |
0.00 |
73.24 |
+73.24
|
| FLORES-200 (xx->en average) |
Win Ratio (XCOMET) |
0.00 |
60.17 |
+60.17
|
| FLORES-200 (en->xx average) |
Win Ratio (KIWI-XXL) |
0.00 |
41.87 |
+41.87
|
Main Takeaways
- Human references in standard datasets (FLORES-200) are often 'gilded' rather than gold, with models like ALMA and GPT-4 frequently producing superior translations.
- The proposed ALMA-R model (trained with CPO) matches or exceeds GPT-4 and WMT competition winners on WMT'21, '22, and '23 test sets (quantitative deltas for ALMA-R specifically not extracted from text, visualized in Figure 1).
- CPO effectively utilizes 'dis-preferred' translations—which may still be high quality but imperfect—to teach the model to avoid minor errors, a signal SFT ignores.