Reinforcement Learning with Conditional Expectation Reward

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Reward Engineering for LLMs

Conditional Expectation Reward (CER) uses the language model itself as an implicit verifier by calculating the probability of generating the reference answer given the model's generated answer.

Core Problem

Existing RLVR methods rely on handcrafted rule-based verifiers that are difficult to construct for general domains with free-form answers and provide only binary feedback.

Why it matters:

Constructing reliable verifiers for domains like physics or finance is costly or infeasible due to diverse valid answer forms
Rule-based verifiers collapse semantically correct but lexically different answers into the 'incorrect' category
Binary feedback fails to reward partially correct answers, providing sparse learning signals

Concrete Example: For a question with reference answer '14', a model might generate '13' (close), '94' (far), or 'fourteen' (synonym). A rigid rule-based verifier assigns 0 reward to '13' and 'fourteen' if they don't match the specific rule, whereas CER assigns high reward to 'fourteen' and moderate reward to '13' based on the model's internal probability of regenerating '14'.

Key Novelty

Self-Supervised Implicit Verification via Conditional Expectation

Instead of an external verifier, CER uses the model's own likelihood of generating the ground truth *after* it has generated a candidate answer
It acts as a soft relaxation of exact-match: if the generated answer is semantically consistent with the reference, the model assigns higher probability to the reference
Requires no auxiliary models or domain-specific rules, making it applicable to general reasoning tasks beyond math

Architecture

Illustration of the CER computation process in a tensorized form.

Evaluation Highlights

Outperforms exact-match and perplexity-based verifiers on general domain datasets (MMLU-Pro, SuperGPQA) with Qwen3-4B-Base and Qwen3-8B-Base
Achieves comparable performance to rule-based rewards on mathematical datasets (MATH500, AIME) without using any domain-specific rules
Combining CER with rule-based rewards (Rule+CER) yields the best overall performance, demonstrating complementary strengths

Breakthrough Assessment

8/10

Elegantly solves the 'hard verifier' bottleneck in RLVR by using the model itself. Theoretical grounding as a soft relaxation of exact-match is strong, and empirical results across diverse domains confirm generality.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning on reasoning tasks with available reference answers

Inputs: Question q, Reference Answer a*

Outputs: Generated Chain-of-Thought s, Generated Answer a

Pipeline Flow

Policy Model (Generates N solutions)
Implicit Verifier (Computes CER using reused samples)
Optimization (Updates Policy via RLOO)

System Modules

Policy Model

Generate N distinct solutions (thought process s + final answer a) for a given question q

Model or implementation: Qwen3-4B-Base or Qwen3-8B-Base

Implicit Verifier

Compute the CER reward for each generated answer a_i by estimating P(a* | a_i, q)

Model or implementation: Same as Policy Model (Self-Verification)

Novel Architectural Elements

Self-verification loop where the generation samples are reused to compute a posterior probability reward signal (CER)
Tensorized reward computation reusing the N samples from the policy gradient step to estimate the conditional expectation without extra forward passes

Modeling

Base Model: Qwen3-4B-Base and Qwen3-8B-Base

Training Method: Reinforcement Learning (RLOO)

Objective Functions:

Purpose: Maximize expected reward.

Formally: J(θ) = E[ρ(a, a*)] where ρ is the Conditional Expectation Reward.
Purpose: Estimate CER empirically.

Formally: ρ(a, a*) ≈ (1/M) * Σ [ P(a* | s_j, q) * P(a | s_j, q) / P(a | q) ]

Adaptation: Full model update (implied by context)

Trainable Parameters: Not explicitly specified (assumed full parameters)

Training Data:

WebInstruct (General domain): 50K non-math questions
MATH-7.5K (Math domain): Standard math training set

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 32 questions
number_of_solutions_N: 16
+ 5 more
sample_size_M: 16
temperature_train: 1.0
top_p_train: 1.0
max_question_length: 2048
max_output_length_train: 4096

Compute: Not reported in the paper

Comparison to Prior Work

vs. General-verifier: CER requires no external model; it is self-contained.
vs. VeriFree: CER conditions on the generated answer, providing a link between generation and reference, rather than just checking reference likelihood.
vs. Rule-based: CER is soft and domain-agnostic, handling free-form answers without manual rules.
+ 1 more
vs. Exact Match: CER provides graded rewards for partial correctness or semantic similarity.

Limitations

Computational cost depends on sample size M (though reusing samples mitigates this efficiently)
Relies on the model's internal calibration; if the model is very poor at assigning probability to a*, reward might be noisy
May be less precise than symbolic verifiers for strict mathematical equivalence if the model's similarity estimation is imperfect

Reproducibility

Code: https://github.com/changyi7231/CER

Code is publicly available at https://github.com/changyi7231/CER. Training used 5 epochs for MATH and 1 epoch for WebInstruct. Hyperparameters like learning rate and batch size are provided.

📊 Experiments & Results

Evaluation Setup

Reasoning tasks across mathematical and general domains.

Benchmarks:

MATH500 (Mathematical Reasoning)
AMC23 (Mathematical Reasoning)
AIME2024 (Mathematical Reasoning)
AIME2025 (Mathematical Reasoning)
SuperGPQA (General Domain Reasoning (Science/Academic))
MMLU-Pro (General Domain Reasoning)

Metrics:

pass@1
Statistical methodology: Average performance over 16 evaluation runs reported for mathematical datasets.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
General Domain Results (trained on WebInstruct) show CER consistently outperforming baselines on MMLU-Pro and SuperGPQA.
MMLU-Pro	pass@1	47.5	48.1	+0.6
SuperGPQA	pass@1	32.8	33.5	+0.7
Mathematical Domain Results (trained on MATH-7.5K) show CER is competitive with highly specific Rule-based verifiers and outperforms model-based verifiers.
MATH500	pass@1	59.2	58.6	-0.6
MATH500	pass@1	59.2	60.1	+0.9

Main Takeaways

CER is domain-agnostic: It works well on both math and general reasoning without changing the formulation.
CER provides denser signals than Exact Match: Soft rewards allow learning from partially correct or semantically equivalent answers that fail strict string matching.
Efficiency via sample reuse: Tensorized computation allows CER to be computed using the same samples generated for exploration, adding negligible training overhead.
Complementarity: CER combines effectively with rule-based verifiers (Rule+CER) to boost performance further, correcting the sparsity of rules with the softness of CER.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Policy Gradient methods (REINFORCE / RLOO)
Large Language Models (autoregressive generation)
Bayes' rule and importance sampling

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—RL where rewards are determined by an objective, deterministic verifier (usually rule-based)

CER: Conditional Expectation Reward—the expected likelihood of generating the reference answer conditioned on the generated answer

Pass@1: A metric measuring the percentage of problems where the model's first generated answer is correct

RLOO: REINFORCE Leave-One-Out—a policy gradient estimator that reduces variance by using the average reward of other samples as a baseline

Exact Match: A binary verification method checking if the generated answer string is identical to the reference string

Posterior Predictive Probability: The probability of observing new data (the reference answer) given observed data (the generated answer) under the model

Importance Sampling: A technique to estimate properties of a distribution using samples from a different distribution, used here to compute CER efficiently