Evaluation Setup
Evaluation on standard benchmarks for math, code, and instruction following
Benchmarks:
- GSM8K (Mathematical reasoning)
- HumanEval (Code generation)
- MBPP (Code generation)
- IFEval (Instruction following)
- Arena-Hard (Chatbot arena style alignment)
Metrics:
- Accuracy
- Pass@1
- Win Rate
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- VRPO consistently improves performance over the SFT baseline across all 5 benchmarks (Math, Code, Alignment)
- LLaDA 1.5 achieves the highest math score compared to other strong MDMs and is competitive with autoregressive models like Llama 3 on math tasks
- Variance reduction techniques (budget scaling, allocation, antithetic sampling) are empirically effective and theoretically grounded