← Back to Paper List

Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?

Zihan Chen, Yiming Zhang, Hengguang Zhou, Zenghui Ding, Yining Sun, Cho-Jui Hsieh
Hefei Institutes of Physical Science, Chinese Academy of Sciences, University of Science and Technology of China, University of California, Los Angeles
arXiv (2025)
RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Learning for LLMs LLM Evaluation Generalization in RL
Current benchmarks fail to distinguish between true generalization and overfitting in RL-tuned LLMs, as evidenced by a vanishing performance gap between models trained on train vs. test sets.
Core Problem
Standard benchmarks assume that performance on a held-out test set implies generalization, but RL-tuned LLMs achieve nearly identical scores when trained directly on the test set, invalidating this assumption.
Why it matters:
  • High benchmark scores currently reported for RL methods (like PPO/GRPO) may be illusory, rewarding memorization or narrow pattern matching rather than robust reasoning capabilities
  • Existing evaluation protocols mask critical brittleness: models fail catastrophicallly when problem difficulty increases or when semantic rules are slightly altered (counterfactuals)
  • The community lacks diagnostic metrics to determine if an RL agent has actually learned transferable reasoning skills or just exploited the benchmark distribution
Concrete Example: When a Qwen2.5-7B model trained on standard math problems is given a 'counterfactual' problem where the order of operations is redefined to 'PESAMD' (Parentheses, Exponents, Subtraction...), it ignores the new rule and defaults to the memorized PEMDAS method, proving it recites patterns rather than deducing from premises.
Key Novelty
Oracle Performance Gap (OPG) and Diagnostic Stress Tests
  • Introduces OPG to quantify the 'vanished generalization gap': compares an RL model trained on the training set to an 'Oracle' trained directly on the test set; a near-zero gap implies the benchmark fails to test generalization
  • Proposes a suite of stress tests (Difficulty, Distributional, Counterfactual) to break the 'average score' illusion and reveal where models fail to generalize
  • Establishes three principles for future benchmarks: sufficient difficulty stratification, distributional robustness checks, and balanced evaluation to prevent masking failures
Evaluation Highlights
  • RL models trained on the train set achieve nearly identical performance to 'Oracle' models trained on the test set (OPG ≈ 0%) across MATH, GSM8K, and HeadQA, unlike SFT models which maintain a healthy gap
  • In counterfactual stress tests, Qwen2.5-7B accuracy drops from 74.8% (standard) to 41.2% (counterfactual), confirming reliance on memorized patterns over deductive reasoning
  • On Out-of-Distribution (OOD) math problems, specialized RL models perform worse than the un-tuned base model (falling below baseline accuracy) as semantic distance from training data increases
Breakthrough Assessment
9/10
A critical wake-up call for the RLHF/reasoning community. Systematically debunks the 'unseen test set' assumption for RL and provides rigorous diagnostic tools (OPG) to measure true generalization.
×