Evaluation Setup
Translation quality evaluation on standard benchmarks across 50 languages
Benchmarks:
- Flores-200 (Many-to-Many Translation)
- WMT23 (News Translation)
Metrics:
- COMET-22
- SacreBLEU
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison with state-of-the-art baselines on Flores-200 shows the proposed method (DAT/DATM) approaches the performance of the resource-heavy X-ALMA (SFT) baseline while using significantly less compute. |
| Flores-200 |
COMET-22 |
88.2 |
87.6 |
-0.6
|
| Flores-200 |
COMET-22 |
83.2 |
82.8 |
-0.4
|
| WMT23 |
COMET-22 |
88.9 |
87.8 |
-1.1
|
| WMT23 |
COMET-22 |
85.6 |
84.8 |
-0.8
|
| Training Cost |
Pre-training Tokens |
110 |
20 |
-90
|
| Flores-200 |
COMET-22 |
79.7 |
82.8 |
+3.1
|
Main Takeaways
- Linguistic conflicts are asymmetric: XX→En translation suffers heavily from interference in multilingual training, while En→XX benefits from synergy.
- The bottleneck for LLM-based MMT lies in post-training; a simple multilingual pre-training stage (20B tokens) is sufficient if post-training is handled correctly.
- Model merging degrades performance asymmetrically: it hurts En→XX (synergy-heavy) directions significantly more than XX→En directions, justifying the selective merging strategy.