DPO: Direct Preference Optimization—an algorithm that optimizes language models to match preferences directly without training a separate reward model
PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates policies iteratively while preventing drastic changes that could destabilize training
SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs to teach it specific behaviors or tasks
ACR: Answer Consistency Ratio—a metric measuring the overlap of correctly answered questions between a non-dominant language and English
PPL-based Alignment Score: A score derived from the perplexity of a translation model, indicating how well a non-English reasoning chain aligns with an English one
MGSM: Multilingual Grade School Math—a benchmark dataset for evaluating mathematical reasoning across multiple languages
Iterative DPO: Running multiple rounds of DPO where the model generates new samples to update the preference dataset for the next round
dominant language: The language in which the model has the strongest performance (usually English due to data abundance)
NLLB: No Language Left Behind—a state-of-the-art open-source multilingual translation model used here as the reward model