Evaluation Setup
Four domains: Multi-armed bandits, Simulated Control (MuJoCo), Machine Translation, Instruction Following
Benchmarks:
- Multi-armed bandit (Synthetic numeric optimization) [New]
- Mo-Gymnasium (mo-reacher-v5) (Robotic control (simulated))
- WMT-24 (En-Ja, En-Zh) (Machine Translation)
- AlpacaFarm (Instruction Following)
Metrics:
- Average Return / Reward
- BLEURT (Translation Accuracy)
- jReadability / TRank (Readability)
- GPT-Eval Win Rate
- Language Detection Rate (Non-Chinese %)
- Statistical methodology: Experiments run with 5 random seeds (Bandit/Control).
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Machine Translation (En-Zh) results showing reward hacking mitigation. |
| WMT-24 (En-Zh) with Llama |
Non-Chinese output rate (w/o Penalty) |
68.7% |
5.6% |
-63.1%
|
| WMT-24 (En-Zh) with Llama |
GPT-Eval Win Rate |
71.5% |
74.0% |
+2.5%
|
| Simulated Control (Mo-Reacher) results showing balanced optimization. |
| Mo-Reacher-v5 |
Average Total Reward per step |
-15.71 |
-6.10 |
+9.61
|
| Instruction Following (AlpacaFarm) results. |
| AlpacaFarm |
Length Reward (Llama) |
0.37 |
0.42 |
+0.05
|
Main Takeaways
- MO-GRPO consistently prevents reward hacking across diverse domains (bandits, robotics, NLP) by ensuring high-variance rewards do not dominate.
- In translation tasks, GRPO tends to output the source language (English) to hack readability metrics, a failure mode completely eliminated by MO-GRPO.
- The method is robust to affine transformations of reward scales, eliminating the need for manual reward weight tuning.
- Even in adversarial settings (AlpacaFarm with conflicting length/quality rewards), MO-GRPO finds a better balance than GRPO.