Libra introduces a reasoning-focused reward model benchmark and a generative reward model that uses Chain-of-Thought reasoning to verify answers, overcoming limitations of discriminative scoring.
Core Problem
Current reward models (RMs) struggle with complex reasoning tasks because they lack 'thinking' capabilities and rely on scalar scores that don't align with true correctness.
Why it matters:
Predominant RL training relies on rule-based rewards requiring rigid formats and golden answers, hindering scaling with unlabeled data
Existing RM benchmarks lack challenging questions and diverse responses from advanced reasoning models (like DeepSeek-R1), failing to accurately assess reasoning capabilities
Standard discriminative RMs cannot effectively verify complex logic, acting as weak proxies for human judgment in hard reasoning scenarios
Concrete Example:In complex math problems where the final answer is correct but the logic is flawed (or vice-versa), a standard discriminative RM might assign a high score based on surface features. Libra-RM uses a 'thinking' process to step-by-step verify the reasoning before outputting a judgment.
Key Novelty
Learning-to-Think for Generative Reward Models
Treats the judging process as a verifiable reasoning task itself, where the Reward Model generates a Chain-of-Thought explanation before its final verdict
Constructs a benchmark (Libra Bench) specifically designed to test RMs on hard math problems with responses from advanced reasoning models
Optimizes the Reward Model using rejection sampling and reinforcement learning, mirroring the successful training recipes of reasoning LLMs
Architecture
The overall framework for Libra Bench curation and Libra-RM training
Evaluation Highlights
Libra-RM series achieves state-of-the-art results on reasoning-oriented benchmarks compared to existing RMs
Reasoning models with 'thinking' capabilities (73.7%-78.7% accuracy) significantly outperform non-thinking models (55.1%-69.1%) on Libra Bench
Demonstrates strong correlation between performance on Libra Bench and downstream RL application performance
Breakthrough Assessment
8/10
Strong contribution by applying 'thinking' (inference-time scaling) to the Reward Model itself, rather than just the policy model. The curated benchmark addresses a critical gap in evaluating reasoning RMs.
⚙️ Technical Details
Problem Definition
Setting: Evaluation of Reward Models on complex reasoning tasks using a V2V (Verifiable reasoning to Verifiable judging) pipeline
Inputs: A reasoning problem q, a candidate answer a, and a golden reference (for benchmark construction)
Outputs: A binary judgment of correctness (0 or 1) derived from a generative reasoning process
Pipeline Flow
Input Query & Response
Generative Thinking Process
Final Verdict Generation
System Modules
Input Processing
Concatenates the reasoning problem, the model response, and a judging prompt template
Model or implementation: Libra-RM (Generative)
Thinking Generator
Generates a chain-of-thought rationale evaluating the correctness of the response
Model or implementation: Libra-RM (32B variants)
Verdict Generator
Outputs the final binary judgment (Correct/Incorrect) based on the thinking process
Model or implementation: Libra-RM
Novel Architectural Elements
Application of learning-to-think (long CoT) specifically within the Reward Model architecture to produce verifiable judgments
Transformation of reasoning problems into verifiable judging problems (V2V) for RM training
Modeling
Base Model: Libra-RM-32B (likely based on Qwen/DeepSeek architectures given the baselines, but explicit base model not confirmed in snippet)
Training Method: Rejection Sampling and Reinforcement Learning (implied PPO/GRPO style for thinking)
Objective Functions:
Purpose: Optimize the generative reward model to produce correct judgments.
Formally: Not explicitly detailed in snippet (likely standard language modeling loss on verifiable correct judgments).
Purpose: Rejection sampling for data collection.
Formally: Verify generated judgments against ground truth labels derived from the Libra Bench pipeline.
Adaptation: Full fine-tuning (implied for RM)
Trainable Parameters: 32B
Training Data:
Data strategy: From Verifiable reasoning to Verifiable Judging (V2V)
Source: MATH-500, AIME 2024, AIME 2025 problems
Responses sampled from DeepSeek-R1, Qwen3-32B, QwQ-32B, etc.
Compute: Not reported in the paper
Comparison to Prior Work
vs. Discriminative RMs: Libra-RM outputs text and reasoning traces, allowing for verification and better handling of complex logic
vs. Standard Generative RMs: Libra-RM utilizes 'learning-to-think' (CoT) specifically for the judging process, improving accuracy on hard tasks
vs. RewardBench: Libra Bench focuses specifically on complex reasoning (Math/AIME) with responses from advanced reasoning models, whereas RewardBench covers broader/easier tasks [not cited in paper]
Limitations
Reliance on mathematical problems for verifiable ground truth limits scope to STEM domains currently
Annotation of correctness relies partly on model-based evaluation (though verified), which could introduce bias
High computational cost for 'thinking' RMs compared to simple scalar discriminators
Libra Bench dataset is available at https://huggingface.co/datasets/meituan/Libra-Bench. Code for the training pipeline or specific training hyperparameters are not detailed in the provided text. Model weights availability is mentioned as 'Libra-RM series' but no URL provided in text.
📊 Experiments & Results
Evaluation Setup
Pointwise correctness judging on complex math problems
Benchmarks:
Libra Bench (Reasoning verification (Math)) [New]
Metrics:
Accuracy (Judgment Correctness)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Performance comparison on Libra Bench showing the advantage of thinking models over non-thinking models for reward modeling.
Libra Bench
Accuracy
69.1
78.7
+9.6
Main Takeaways
Existing RM benchmarks are insufficient for evaluating reasoning capabilities due to lack of challenging problems and diverse advanced model responses
Models with 'thinking' capabilities significantly outperform traditional models on the Libra Bench, validating the learning-to-think approach for RMs
Strong correlation observed between Libra Bench accuracy and downstream RL performance, suggesting it is a reliable proxy for optimization
Current RMs (discriminative) generally underperform in reasoning scenarios compared to the proposed generative approach
📚 Prerequisite Knowledge
Prerequisites
Reinforcement Learning from Human Feedback (RLHF)
Chain-of-Thought (CoT) reasoning
Reward Modeling (Discriminative vs. Generative)
Rejection Sampling
Key Terms
Generative Reward Model: A reward model that outputs textual judgments (and potentially reasoning traces) rather than just a scalar score
Discriminative Reward Model: A standard reward model that outputs a scalar score representing quality/preference, usually via a classification head
V2V: From Verifiable reasoning to Verifiable judging—the strategy used to construct the benchmark by transforming math problems into judging tasks
CoT: Chain-of-Thought—a prompting method where the model generates intermediate reasoning steps before the final answer
Inference-time scaling: Improving model performance by allowing it to compute (think) for longer during generation
Rejection Sampling: A training technique where multiple outputs are generated, and only the best (verified) ones are kept for fine-tuning
RLVR: Reinforcement Learning from Verifiable Reward—using rule-based checks (like math answer matching) to train models