Beyond Correctness: Learning Robust Reasoning via Transfer

📝 Paper Summary

Reinforcement Learning for Reasoning LLM Post-training

RLTR enhances reasoning robustness by rewarding the generator when its truncated reasoning prefix allows a separate receiver model to successfully solve the problem.

Core Problem

Reinforcement Learning with Verifiable Rewards (RLVR) optimizes only final-answer correctness, often producing brittle or idiosyncratic reasoning traces that fail to generalize or transfer.

Why it matters:

Models optimized solely for outcome correctness (RLVR) show degraded consistency (Maj@K) as the number of samples increases
Robust reasoning should be reusable and interpretable by others, not just a lucky path to the right answer found by a specific model
Current methods lack incentives for intermediate reasoning quality without expensive step-level human annotations (PRMs)

Concrete Example: On MATH-500, a standard RLVR model achieves high single-sample accuracy but its consistency drops at high sample counts (Maj@16 drops from 81.2 base to 80.2 RLVR). This indicates the reasoning is fragile; RLTR fixes this by ensuring prefixes are stable enough for a second model to complete.

Key Novelty

Reinforcement Learning with Transferable Reward (RLTR)

Operationalizes 'reasoning quality' as 'transferability': a reasoning trace is robust if a different model (Receiver) can finish it correctly after it is truncated.
Introduces a Transfer Reward computed by truncating the Generator's output and checking if the Receiver reaches the correct answer from that prefix.
Combines standard answer correctness rewards with this new transfer signal to optimize the Generator via Group Relative Policy Optimization (GRPO).

Architecture

Overview of the RLTR framework comparing it to standard RLVR. Shows the pipeline where generator output is truncated and passed to a receiver.

Evaluation Highlights

+5.8 points in Maj@64 on the AMC23 benchmark compared to RLVR (61.7% to 67.5%), demonstrating superior consistency.
Matches RLVR's average accuracy on MATH-500 with approximately 2.5x fewer training steps, indicating significantly higher sample efficiency.
+4.4 points in Maj@64 on AIME 2024 (16.7% to 21.1%) and +5.0 points in average accuracy (9.8% to 14.8%) compared to RLVR.

Breakthrough Assessment

8/10

A clever, supervision-free method to enforce process quality. By using a second model as a verifier of 'explainability', it improves robustness and efficiency without human process labels.

⚙️ Technical Details

Problem Definition

Setting: Mathematical reasoning generation where ground truth answers are available for verification

Inputs: Math problem x

Outputs: Reasoning trace and final answer y

Pipeline Flow

Generation: Generator produces full reasoning
Transfer Check: Truncator cuts reasoning -> Receiver continues -> Transfer Reward calculated
Optimization: GRPO updates Generator using combined rewards

System Modules

Generator

Generate full reasoning trace and final answer

Model or implementation: Qwen2.5-7B-Instruct

Truncator (Transfer Check)

Truncate the generated reasoning at a random ratio to test prefix robustness

Model or implementation: Rule-based logic

Receiver (Transfer Check)

Attempt to solve the problem starting from the generator's truncated prefix

Model or implementation: Qwen2.5-3B-Instruct (frozen)

Novel Architectural Elements

Integration of a 'Receiver' model into the reward loop to calculate a Transfer Reward based on cross-model completion success
Dynamic truncation mechanism during RL training to test reasoning robustness at arbitrary steps

Modeling

Base Model: Qwen2.5-7B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward consisting of answer correctness, transferability, and formatting.

Formally: J = E[R_total(y)]
Purpose: Reward if the generator's final answer is correct.

Formally: R_ans = 1[answer(y_gen) = y_gt]
Purpose: Reward if the receiver can find the correct answer from the generator's prefix.

Formally: R_trans = 1[answer(y_rcv) = y_gt]
Purpose: Enforce valid structure.

Formally: R_fmt (format reward)

Key Hyperparameters:

reward_weight_answer (a): 0.1
reward_weight_transfer (t): 1.0
truncation_ratio_tau: Uniform(0.3, 0.9)
+ 1 more
temperature: 1.0

Compute: Increases training FLOPs by ~7% per step due to Receiver inference, but requires ~2.5x fewer steps for convergence, reducing total training cost.

Comparison to Prior Work

vs. RLVR: RLTR adds a transfer reward signal, improving consistency and preventing 'lucky' correct answers.
vs. PRMs: RLTR requires no step-level human annotations or separate reward model training; it uses the Receiver model as a zero-shot verifier.
vs. Sheppard et al. (2024) [not cited in paper]: Similar to 'separating reasoning from answering', but uses cross-model transfer rather than self-consistency.

Limitations

Requires an additional Receiver model during training, increasing per-step computational cost.
Depends on the capability of the Receiver model; a weak Receiver might fail even on good reasoning.
Currently evaluated primarily on mathematical and scientific reasoning tasks with objective ground truth.
Does not strictly enforce human-readable explanations, only machine-transferable ones.

Reproducibility

Code availability is not provided. Model checkpoints (Qwen2.5) are public. Method relies on standard RLVR/GRPO setup with an added inference step for the Receiver.

📊 Experiments & Results

Evaluation Setup

Mathematical and scientific reasoning tasks evaluated via sampling

Benchmarks:

MATH-500 (Moderate difficulty math problems)
GSM8K (Grade school math word problems)
AMC23 (Competition-level math (hard))
AIME 2024 (Competition-level math (very hard))
GPQA (Scientific reasoning (out-of-domain))

Metrics:

Maj@K (Majority Voting at K)
Average Accuracy
Transferability (accuracy of receiver on prefixes)
Statistical methodology: Three runs with different random seeds, reporting averages

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on hard math benchmarks shows RLTR significantly improves consistency (Maj@K) over RLVR.
AMC23	Maj@64	61.7	67.5	+5.8
AIME 2024	Maj@64	16.7	21.1	+4.4
AIME 2024	Average Accuracy	9.8	14.8	+5.0
Results on moderate benchmarks show RLTR fixes RLVR's consistency degradation at high K.
MATH-500	Maj@64	82.6	84.2	+1.6
GSM8K	Average Accuracy	89.1	92.0	+2.9
Computational efficiency analysis.
MATH-500	EFLOPs (ExaFLOPs) to reach convergence	39.76	92.75	+52.99

Experiment Figures

Training dynamics comparing RLTR and RLVR on MATH-500 over training steps.

Transfer accuracy across different truncation ratios (0.1 to 0.9).

Main Takeaways

RLTR consistently improves Maj@K across all benchmarks, particularly on harder tasks (AMC23, AIME), indicating more robust reasoning.
The method is highly sample-efficient, matching RLVR performance with ~2.5x fewer steps despite a slight per-step compute increase.
Transferability correlates with Maj@K: as models learn to generate transferable prefixes, their self-consistency improves.
Generalizes to scientific reasoning (GPQA) and works across different model families (Llama-3), showing the method is not architecture-specific.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Reward (RLVR)
Group Relative Policy Optimization (GRPO)
Language Model Sampling (Maj@K)

Key Terms

RLVR: Reinforcement Learning with Verifiable Reward—optimizing models using ground-truth correctness (e.g., math answers) rather than a learned reward model

Transfer Reward: A novel reward signal measuring whether a truncated reasoning prefix from the generator can be successfully completed by a receiver model

Receiver Model: A separate, frozen language model used to evaluate the transferability of the generator's reasoning prefixes

Maj@K: Majority Voting at K—a metric that samples K solutions and selects the most frequent answer, verifying if it matches the ground truth; checks consistency

Pass@K: A metric measuring the probability that at least one of K sampled solutions is correct; checks diversity

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of samples for the same input, stabilizing training without a value network