Libra: Assessing and Improving Reward Model by Learning to Think

📝 Paper Summary

Reward Modeling Reinforcement Learning (RL) Reasoning Evaluation

Libra introduces a reasoning-focused reward model benchmark and a generative reward model that uses Chain-of-Thought reasoning to verify answers, overcoming limitations of discriminative scoring.

Core Problem

Current reward models (RMs) struggle with complex reasoning tasks because they lack 'thinking' capabilities and rely on scalar scores that don't align with true correctness.

Why it matters:

Predominant RL training relies on rule-based rewards requiring rigid formats and golden answers, hindering scaling with unlabeled data
Existing RM benchmarks lack challenging questions and diverse responses from advanced reasoning models (like DeepSeek-R1), failing to accurately assess reasoning capabilities
Standard discriminative RMs cannot effectively verify complex logic, acting as weak proxies for human judgment in hard reasoning scenarios

Concrete Example: In complex math problems where the final answer is correct but the logic is flawed (or vice-versa), a standard discriminative RM might assign a high score based on surface features. Libra-RM uses a 'thinking' process to step-by-step verify the reasoning before outputting a judgment.

Key Novelty

Learning-to-Think for Generative Reward Models

Treats the judging process as a verifiable reasoning task itself, where the Reward Model generates a Chain-of-Thought explanation before its final verdict
Constructs a benchmark (Libra Bench) specifically designed to test RMs on hard math problems with responses from advanced reasoning models
Optimizes the Reward Model using rejection sampling and reinforcement learning, mirroring the successful training recipes of reasoning LLMs

Architecture

The overall framework for Libra Bench curation and Libra-RM training

Evaluation Highlights

Libra-RM series achieves state-of-the-art results on reasoning-oriented benchmarks compared to existing RMs
Reasoning models with 'thinking' capabilities (73.7%-78.7% accuracy) significantly outperform non-thinking models (55.1%-69.1%) on Libra Bench
Demonstrates strong correlation between performance on Libra Bench and downstream RL application performance

Breakthrough Assessment

8/10

Strong contribution by applying 'thinking' (inference-time scaling) to the Reward Model itself, rather than just the policy model. The curated benchmark addresses a critical gap in evaluating reasoning RMs.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Reward Models on complex reasoning tasks using a V2V (Verifiable reasoning to Verifiable judging) pipeline

Inputs: A reasoning problem q, a candidate answer a, and a golden reference (for benchmark construction)

Outputs: A binary judgment of correctness (0 or 1) derived from a generative reasoning process

Pipeline Flow

Input Query & Response
Generative Thinking Process
Final Verdict Generation

System Modules

Input Processing

Concatenates the reasoning problem, the model response, and a judging prompt template

Model or implementation: Libra-RM (Generative)

Thinking Generator

Generates a chain-of-thought rationale evaluating the correctness of the response

Model or implementation: Libra-RM (32B variants)

Verdict Generator

Outputs the final binary judgment (Correct/Incorrect) based on the thinking process

Model or implementation: Libra-RM

Novel Architectural Elements

Application of learning-to-think (long CoT) specifically within the Reward Model architecture to produce verifiable judgments
Transformation of reasoning problems into verifiable judging problems (V2V) for RM training

Modeling

Base Model: Libra-RM-32B (likely based on Qwen/DeepSeek architectures given the baselines, but explicit base model not confirmed in snippet)

Training Method: Rejection Sampling and Reinforcement Learning (implied PPO/GRPO style for thinking)

Objective Functions:

Purpose: Optimize the generative reward model to produce correct judgments.

Formally: Not explicitly detailed in snippet (likely standard language modeling loss on verifiable correct judgments).
Purpose: Rejection sampling for data collection.

Formally: Verify generated judgments against ground truth labels derived from the Libra Bench pipeline.

Adaptation: Full fine-tuning (implied for RM)

Trainable Parameters: 32B

Training Data:

Data strategy: From Verifiable reasoning to Verifiable Judging (V2V)
Source: MATH-500, AIME 2024, AIME 2025 problems
Responses sampled from DeepSeek-R1, Qwen3-32B, QwQ-32B, etc.

Compute: Not reported in the paper

Comparison to Prior Work

vs. Discriminative RMs: Libra-RM outputs text and reasoning traces, allowing for verification and better handling of complex logic
vs. Standard Generative RMs: Libra-RM utilizes 'learning-to-think' (CoT) specifically for the judging process, improving accuracy on hard tasks
vs. RewardBench: Libra Bench focuses specifically on complex reasoning (Math/AIME) with responses from advanced reasoning models, whereas RewardBench covers broader/easier tasks [not cited in paper]

Limitations

Reliance on mathematical problems for verifiable ground truth limits scope to STEM domains currently
Annotation of correctness relies partly on model-based evaluation (though verified), which could introduce bias
High computational cost for 'thinking' RMs compared to simple scalar discriminators

Reproducibility

Code: https://huggingface.co/datasets/meituan/Libra-Bench

Libra Bench dataset is available at https://huggingface.co/datasets/meituan/Libra-Bench. Code for the training pipeline or specific training hyperparameters are not detailed in the provided text. Model weights availability is mentioned as 'Libra-RM series' but no URL provided in text.

📊 Experiments & Results

Evaluation Setup

Pointwise correctness judging on complex math problems

Benchmarks:

Libra Bench (Reasoning verification (Math)) [New]

Metrics:

Accuracy (Judgment Correctness)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison on Libra Bench showing the advantage of thinking models over non-thinking models for reward modeling.
Libra Bench	Accuracy	69.1	78.7	+9.6

Main Takeaways

Existing RM benchmarks are insufficient for evaluating reasoning capabilities due to lack of challenging problems and diverse advanced model responses
Models with 'thinking' capabilities significantly outperform traditional models on the Libra Bench, validating the learning-to-think approach for RMs
Strong correlation observed between Libra Bench accuracy and downstream RL performance, suggesting it is a reliable proxy for optimization
Current RMs (discriminative) generally underperform in reasoning scenarios compared to the proposed generative approach

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Chain-of-Thought (CoT) reasoning
Reward Modeling (Discriminative vs. Generative)
Rejection Sampling

Key Terms

Generative Reward Model: A reward model that outputs textual judgments (and potentially reasoning traces) rather than just a scalar score

Discriminative Reward Model: A standard reward model that outputs a scalar score representing quality/preference, usually via a classification head

V2V: From Verifiable reasoning to Verifiable judging—the strategy used to construct the benchmark by transforming math problems into judging tasks

CoT: Chain-of-Thought—a prompting method where the model generates intermediate reasoning steps before the final answer

Inference-time scaling: Improving model performance by allowing it to compute (think) for longer during generation

Rejection Sampling: A training technique where multiple outputs are generated, and only the best (verified) ones are kept for fine-tuning

RLVR: Reinforcement Learning from Verifiable Reward—using rule-based checks (like math answer matching) to train models