← Back to Paper List

MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization

Shuaijie She, Shujian Huang, Wei Zou, Wenhao Zhu, Xiang Liu, Xiang Geng, Jiajun Chen
National Key Laboratory for Novel Software Technology, Nanjing University
Annual Meeting of the Association for Computational Linguistics (2024)
Reasoning RL Benchmark

📝 Paper Summary

Multilingual Reasoning LLM Alignment
MAPO improves multilingual reasoning in LLMs by using translation probabilities to align reasoning processes in non-dominant languages with the stronger dominant language (English) via preference optimization.
Core Problem
LLMs exhibit inconsistent reasoning abilities across languages, performing significantly better in English than in other languages due to training data imbalance.
Why it matters:
  • Current supervised fine-tuning (SFT) relies on translated data, which is scarce, expensive, and prone to translation errors
  • SFT only fills missing data gaps but fails to narrow the inherent capability gap between dominant and non-dominant languages
  • Without alignment, models struggle to generalize reasoning skills to out-of-domain tasks in lower-resource languages
Concrete Example: A model might solve a complex math problem correctly in English but fail in Thai because the internal reasoning path in Thai diverges from the successful English pattern. SFT on translated data forces the model to mimic the final answer without necessarily aligning the underlying reasoning logic.
Key Novelty
Multilingual-Alignment-as-Preference Optimization (MAPO)
  • Uses an off-the-shelf translation model to score how well a reasoning chain in a non-dominant language aligns with the dominant language (English) version
  • Treats this translation probability as a reward signal: if the non-dominant reasoning translates back to English well, it is considered 'better' and preferred
  • Optimizes the model using PPO or DPO to favor these highly-aligned reasoning paths without needing expensive human annotation
Evaluation Highlights
  • +16.2% accuracy improvement on the out-of-domain MSVAMP benchmark for MathOctopus-7B
  • +13.3% accuracy improvement on MNumGLUESub for MathOctopus-7B
  • +6.1% accuracy improvement on MGSM for MathOctopus-7B, achieving state-of-the-art results among 7B models
Breakthrough Assessment
7/10
Novel application of translation models as preference oracles for reasoning alignment. Shows strong empirical gains, especially on out-of-domain tasks, effectively addressing the multilingual gap without new human labels.
×