MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization

📝 Paper Summary

Multilingual Reasoning LLM Alignment

MAPO improves multilingual reasoning in LLMs by using translation probabilities to align reasoning processes in non-dominant languages with the stronger dominant language (English) via preference optimization.

Core Problem

LLMs exhibit inconsistent reasoning abilities across languages, performing significantly better in English than in other languages due to training data imbalance.

Why it matters:

Current supervised fine-tuning (SFT) relies on translated data, which is scarce, expensive, and prone to translation errors
SFT only fills missing data gaps but fails to narrow the inherent capability gap between dominant and non-dominant languages
Without alignment, models struggle to generalize reasoning skills to out-of-domain tasks in lower-resource languages

Concrete Example: A model might solve a complex math problem correctly in English but fail in Thai because the internal reasoning path in Thai diverges from the successful English pattern. SFT on translated data forces the model to mimic the final answer without necessarily aligning the underlying reasoning logic.

Key Novelty

Multilingual-Alignment-as-Preference Optimization (MAPO)

Uses an off-the-shelf translation model to score how well a reasoning chain in a non-dominant language aligns with the dominant language (English) version
Treats this translation probability as a reward signal: if the non-dominant reasoning translates back to English well, it is considered 'better' and preferred
Optimizes the model using PPO or DPO to favor these highly-aligned reasoning paths without needing expensive human annotation

Evaluation Highlights

+16.2% accuracy improvement on the out-of-domain MSVAMP benchmark for MathOctopus-7B
+13.3% accuracy improvement on MNumGLUESub for MathOctopus-7B
+6.1% accuracy improvement on MGSM for MathOctopus-7B, achieving state-of-the-art results among 7B models

Breakthrough Assessment

7/10

Novel application of translation models as preference oracles for reasoning alignment. Shows strong empirical gains, especially on out-of-domain tasks, effectively addressing the multilingual gap without new human labels.

⚙️ Technical Details

Problem Definition

Setting: Multilingual mathematical reasoning where a model must solve problems in various non-dominant languages (e.g., Thai, Swahili) as well as it does in a dominant language (English)

Inputs: Math question x in a non-dominant language

Outputs: Reasoning chain and final answer Y

Pipeline Flow

Sampling Phase: Generate reasoning paths in non-dominant languages and English
Preference Estimation: Calculate alignment scores using a translation model
Preference Optimization: Update LLM using DPO or PPO

System Modules

Base LLM (Policy)

Generate reasoning chains for math problems in both dominant (English) and non-dominant languages

Model or implementation: MathOctopus-7B/13B, MetaMathOctopus-7B/13B, MistralMathOctopus-7B

Translation Model (Reward Model)

Compute alignment score based on translation probability from non-dominant reasoning to dominant reasoning

Model or implementation: NLLB-200-distilled-600M

Optimization Algorithm

Update the Base LLM weights to favor reasoning paths with higher alignment scores

Model or implementation: DPO or PPO

Novel Architectural Elements

Use of an external, frozen translation model (NLLB) as a proxy reward model to score reasoning consistency across languages
Construction of preference pairs based purely on translation probability alignment rather than human labeling or correctness checking

Modeling

Base Model: MathOctopus (LLaMA-2 based) and MetaMathOctopus (MetaMath based)

Training Method: Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO)

Objective Functions:

Purpose: Maximize the likelihood of preferred (aligned) reasoning over less aligned reasoning.

Formally: DPO loss L_DPO
Purpose: Maximize expected alignment score reward.

Formally: PPO objective max E[r_theta(x, y)]

Adaptation: Full fine-tuning (implied by iterative DPO description) or LoRA (for PPO, mentioned in appendix)

Training Data:

Preference data derived from MNumGLUESub (subset of NumGLUE tasks 1, 4, 8)
1700 problems translated into 9 non-English languages
Sampled n outputs per language to create (n choose 2) pairs ranked by translation alignment score

Key Hyperparameters:

learning_rate: 1e-5 (for m-RFT baseline)
iterations: 3 rounds for Iterative DPO

Compute: Not reported in the paper

Comparison to Prior Work

vs. SFT: MAPO aligns reasoning processes dynamically using a translation signal rather than just mimicking static translated targets
vs. m-RFT: MAPO uses translation alignment probability as a continuous reward signal rather than binary correctness, and optimizes via preference learning (DPO) rather than simple supervised learning on correct samples
vs. RLEIF (WizardMath): MAPO focuses on cross-lingual alignment rather than evolving instruction complexity

Limitations

Dependency on the quality of the off-the-shelf translation model (NLLB); poor translations could provide noisy signals
Preference estimation dataset (MNumGLUESub) is relatively small (1700 problems)
Requires dominant language (English) performance to be significantly better than non-dominant languages to serve as a valid reference anchor

Reproducibility

Code: https://github.com/NJUNLP/MAPO

Code is publicly available at https://github.com/NJUNLP/MAPO. The paper uses open-source models (LLaMA-2, MetaMath, NLLB) and public datasets (NumGLUE, MGSM, SVAMP). Specific hyperparameters for DPO/PPO (beta, batch size) are not detailed in the main text but referenced in Appendix.

📊 Experiments & Results

Evaluation Setup

Multilingual mathematical reasoning across 10 languages (English + 9 others)

Benchmarks:

MGSM (Grade school math reasoning)
MSVAMP (Math word problems (Out-of-domain test set))
MNumGLUESub (Arithmetic reasoning (In-domain preference source)) [New]

Metrics:

Accuracy
Answer Consistency Ratio (ACR)
PPL-based Alignment Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MAPO consistently outperforms baselines across diverse benchmarks, with particularly large gains on the out-of-domain MSVAMP dataset.
MSVAMP	Accuracy	43.3	59.5	+16.2
MNumGLUESub	Accuracy	46.2	59.5	+13.3
MGSM	Accuracy	45.7	51.8	+6.1
MSVAMP	Accuracy	50.9	65.6	+14.7
MSVAMP	Accuracy	67.8	74.6	+6.8
Low-resource languages see the most significant benefits from alignment.
MSVAMP	Accuracy (Bengali)	Not reported in the paper	Not reported in the paper	+21.1
MSVAMP	Accuracy (Thai)	Not reported in the paper	Not reported in the paper	+19.3

Experiment Figures

Visualization of reasoning process alignment consistency

Main Takeaways

MAPO significantly improves reasoning generalization, evidenced by the massive +16.2% gain on the out-of-domain MSVAMP dataset
Low-resource languages (Bengali, Thai, Swahili) benefit most from alignment, with gains around +20%
Even English performance improves slightly (despite being the reference), suggesting that enforcing cross-lingual consistency refines general reasoning capabilities
Reasoning consistency (measured by ACR) improves, indicating the model is not just guessing correctly but reasoning similarly across languages

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and Supervised Fine-Tuning (SFT)
Familiarity with Reinforcement Learning from Human Feedback (RLHF)
Knowledge of Preference Optimization algorithms (PPO, DPO)

Key Terms

DPO: Direct Preference Optimization—an algorithm that optimizes language models to match preferences directly without training a separate reward model

PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates policies iteratively while preventing drastic changes that could destabilize training

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs to teach it specific behaviors or tasks

ACR: Answer Consistency Ratio—a metric measuring the overlap of correctly answered questions between a non-dominant language and English

PPL-based Alignment Score: A score derived from the perplexity of a translation model, indicating how well a non-English reasoning chain aligns with an English one

MGSM: Multilingual Grade School Math—a benchmark dataset for evaluating mathematical reasoning across multiple languages

Iterative DPO: Running multiple rounds of DPO where the model generates new samples to update the preference dataset for the next round

dominant language: The language in which the model has the strongest performance (usually English due to data abundance)

NLLB: No Language Left Behind—a state-of-the-art open-source multilingual translation model used here as the reward model