R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

📝 Paper Summary

LLM Evaluation Multimodal Reasoning Benchmarks

R-Bench is a rigorous, graduate-level benchmark for text and multimodal reasoning that reveals significant gaps in current SOTA models, particularly in multimodal contexts where even OpenAI o1 achieves only ~53% accuracy.

Core Problem

Existing reasoning benchmarks (like MMLU and MMMU) are becoming saturated by advanced models and fail to distinguish complex reasoning capabilities (System-II) from simple knowledge retrieval (System-I), often lacking rigorous difficulty calibration or multilingual balance.

Why it matters:

Current benchmarks like MMLU are nearing saturation (o1 achieves 92.3%), limiting their utility for guiding future model improvements.
Evaluating 'slow' deliberate reasoning requires different data than 'quick' intuitive thinking; most benchmarks conflate the two.
Multimodal and multilingual reasoning are often tested separately, failing to assess if models have truly internalized reasoning skills across modalities and languages.

Concrete Example: While OpenAI o1 achieves 92.3% on the undergraduate-level MMLU benchmark, it only scores 53.2% on the multimodal section of R-Bench, highlighting a massive gap between text-based knowledge retrieval and visual complex reasoning.

Key Novelty

ReasoningBench (R-Bench)

Constructed from graduate-level exams and homework across 108 subjects at Tsinghua University, ensuring high difficulty and rigorous expert verification.
Uses a novel 'Model-Screening' filter where questions are only included if the reasoning-specialized o1 model requires >2,000 reasoning tokens to solve them.
Provides strict one-to-one English-Chinese translation for every question to test cross-lingual reasoning consistency rather than just language proficiency.

Architecture

The 6-step construction pipeline of R-Bench, detailing how raw data is converted into the final benchmark.

Evaluation Highlights

OpenAI o1 achieves 69.0% on text reasoning (R-Bench-T) but drops to 53.2% on multimodal reasoning (R-Bench-M), significantly outperforming GPT-4o.
GPT-4o shows a massive performance gap between modalities, scoring 53.6% on text questions but only 33.7% on multimodal questions.
Models demonstrate high cross-lingual consistency (>70% for most models), indicating they are learning underlying reasoning patterns rather than overfitting to specific languages.

Breakthrough Assessment

9/10

Establishes a new, much-needed standard for 'System-II' reasoning evaluation where current SOTA fails significantly. The rigorous construction pipeline (expert + o1-token filtering) sets a high bar for future benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) on complex reasoning tasks.

Inputs: Graduate-level questions $q$ containing text and optional images, with 6 candidate choices.

Outputs: Predicted option from set of 6 choices (A-F).

Pipeline Flow

Group: Data Collection (Step 1-2)
Group: Digitization (Step 3)
Group: Filtering (Step 4-5)
Group: Standardization (Step 6)

System Modules

Expert Collection

Source questions from 100+ graduate courses across 19 departments

Model or implementation: Human Experts (51 PhD/Master students)

Data Digitization

Convert raw files (PDF, Word, screenshots) into structured Excel data

Model or implementation: GPT-4o + Mathpix + Human Annotators

Model-Based Filtering (Filtering)

Filter out easy questions by measuring reasoning effort

Model or implementation: OpenAI o1 (API)

Manual Review & Standardization (Filtering)

Check for ambiguity, completeness, and balance; format as single-choice

Model or implementation: Human Experts + GPT-4o (for option generation)

Novel Architectural Elements

Model-Screening via Reasoning Tokens: Using the number of internal 'reasoning tokens' generated by OpenAI o1 (>2000) as a hard filter to ensure questions require System-II thinking.

Comparison to Prior Work

vs. MMLU: R-Bench is significantly harder (graduate vs undergraduate/general), includes multimodal data, and o1 scores ~69% vs ~92%
vs. MMMU: R-Bench includes aligned text-only subset for LLMs and strictly aligned bilingual data
vs. FrontierMath/AIME: R-Bench covers 108 subjects across 19 departments, not just mathematics
+ 1 more
Novelty: First benchmark to use 'reasoning token count' from advanced reasoning models as a difficulty filter for dataset construction [not cited in paper as prior work]

Limitations

Multimodal subset (665 questions) is smaller than the text subset (1,094 questions).
Reliance on proprietary models (GPT-4o, o1) for OCR and difficulty filtering introduces dependency on closed-source systems.
Proof-based questions were excluded due to automated evaluation difficulties, potentially omitting a class of rigorous reasoning problems.

Reproducibility

Data and code are stated to be publicly available, though the URL is not explicitly printed in the text body (likely a hyperlink in original PDF). Benchmark includes English and Chinese versions. Evaluation uses OpenCompass and VLMEvalKit. Prompts for CoT are mentioned in appendix.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation with Chain-of-Thought (CoT) prompting on complex reasoning questions.

Benchmarks:

R-Bench-T (Text-only complex reasoning (English)) [New]
R-Bench-M (Multimodal complex reasoning (English)) [New]
R-Bench-T (zh) / R-Bench-M (zh) (Chinese versions of the above) [New]

Metrics:

Accuracy (Pass@1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on R-Bench-T (Text) showing the superiority of reasoning-specialized models (o1) over general chat models (GPT-4o) and the gap for open-source models.
R-Bench-T	Accuracy	53.6	69.0	+15.4
Performance on R-Bench-M (Multimodal) highlighting the extreme difficulty of the benchmark and the degradation of performance when visual modalities are introduced.
R-Bench-M	Accuracy	33.7	53.2	+19.5
Cross-modal gap analysis showing how much harder multimodal reasoning is for the same model.
R-Bench (Text vs Multimodal)	Accuracy	53.2	69.0	+15.8

Experiment Figures

Radar chart or bar chart of GPT-4o performance across different disciplines.

Main Takeaways

Existing multidisciplinary evaluations (MMLU/MMMU) are near saturation for top models like o1, but R-Bench exposes significant remaining weaknesses (o1 only ~69% on text, ~53% on multimodal).
A massive gap exists between text and multimodal reasoning capabilities; even the best model (GPT-4o) drops from 53.6% accuracy on text to 33.7% on multimodal tasks.
Chain-of-Thought (CoT) improves performance for standard chat models (GPT-4o) but provides no benefit for specialized reasoning models (o1-mini), suggesting these models already internalize the reasoning process.
Models show high cross-lingual consistency (>70%), suggesting that reasoning capability is becoming language-agnostic in advanced foundation models.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM evaluation benchmarks (MMLU, MMMU)
Distinction between System-I (intuitive) and System-II (deliberate) reasoning
Basics of Chain-of-Thought (CoT) prompting

Key Terms

System-II: Slow, deliberate, and logical reasoning processes (as opposed to fast, intuitive System-I thinking), which this benchmark aims to evaluate.

MMLU: Massive Multitask Language Understanding—a popular benchmark for general knowledge and reasoning, now considered close to saturation.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning benchmark, considered a standard for MLLMs.

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer.

o1: OpenAI o1—a large language model trained specifically for complex reasoning tasks using reinforcement learning.

OCR: Optical Character Recognition—technology used to convert images of text into machine-encoded text.

Reasoning Tokens: Internal tokens generated by models like OpenAI o1 during their 'thought process' before outputting a visible response; used here as a proxy for question difficulty.