Recall-Extend Dynamics: Enhancing Small Language Models through Controlled Exploration and Refined Offline Integration

📝 Paper Summary

Small Language Models (SLMs) Reasoning Capability Enhancement Reinforcement Learning with Verifiable Rewards (RLVR)

RED enhances small model reasoning by treating RL as a Recall phase and SFT as an Extend phase, dynamically balancing them via entropy changes and accuracy-aware policy shifts.

Core Problem

Small models trained via SFT distillation suffer from redundancy and overthinking, while standard RLVR lacks sufficient exploration space to discover new reasoning patterns on its own.

Why it matters:

Small models (e.g., 1.5B) are computationally efficient but struggle to match the reasoning depth of larger models without effective distillation
Naive combinations of SFT and RL often lead to entropy collapse or performance degradation when offline data is forced into the policy
Existing 'pre-SFT + post-RL' pipelines are inefficient, with long rollout times and diminishing returns on exploration

Concrete Example: A small model trained only with SFT might produce correct answers but generate excessive 'overthinking' tokens (e.g., repeating 'Wait', 'Alternatively') that add no value. Conversely, pure RL might fail to solve hard problems because it cannot explore the solution space effectively without the 'guidance' of new knowledge from SFT data.

Key Novelty

Recall-Extend Dynamics (RED)

Conceptualizes RLVR as 'Recall' (refining existing paths, reducing entropy) and SFT as 'Extend' (introducing new patterns, increasing entropy)
Uses the ratio of entropy changes between RL and SFT to dynamically weight the contribution of offline SFT data during training
Introduces an 'Accuracy-aware Policy Shift' that treats high-accuracy samples as on-policy (trusting the model) and low-accuracy samples as off-policy (imitating the teacher)

Architecture

Conceptual diagram of the RED training framework showing the interplay between RLVR and Offline-SFT.

Evaluation Highlights

Achieves 65.5% pass@1 on MATH500, outperforming the strong baseline LUFFY (63.8%) and Qwen2.5-Math-1.5B-Instruct (65.2%)
Outperforms standard SFT+GRPO by +5.3% on MATH500 and +3.5% on AIME24
Significantly reduces generation length to 2,050 tokens compared to >3,000 for baselines like ReLIFT, improving inference efficiency while maintaining accuracy

Breakthrough Assessment

7/10

Solid methodological improvement for small models. The dynamic entropy balancing is a clever heuristic for stabilizing RL+SFT, though the core idea is an evolutionary refinement of existing hybrid training rather than a radical paradigm shift.

⚙️ Technical Details

Problem Definition

Setting: Reasoning task optimization using Reinforcement Learning with Verifiable Rewards (RLVR) combined with offline distilled data

Inputs: Query q

Outputs: Reasoning chain and final answer o

Pipeline Flow

Input Query -> Policy Model (Small LM) -> Sample Groups of Outputs
Reward Calculation -> Advantage Estimation (GRPO)
Dynamic Entropy Calculation -> Weight Adjustment
Loss Calculation (Weighted combination of RL Loss and Adaptive Offline SFT Loss) -> Model Update

System Modules

Policy Model

Generate reasoning traces and answers

Model or implementation: Qwen2.5-1.5B-MATH

Dynamic Weight Controller (Optimization)

Calculate the weighting factor 'w' for the offline SFT loss based on entropy changes

Model or implementation: Mathematical formula (Equation 13)

Accuracy-aware Policy Shifter (Optimization)

Estimate pi_offline for importance sampling based on correctness

Model or implementation: Heuristic formula (Equation 12)

Novel Architectural Elements

Dynamic entropy-based weighting mechanism connecting RL and SFT losses within the optimization loop
Accuracy-dependent policy offset mechanism for integrating offline data into the importance sampling ratio

Modeling

Base Model: Qwen2.5-1.5B-MATH

Training Method: Hybrid of GRPO (RL) and Importance-Weighted SFT

Objective Functions:

Purpose: Optimize policy to maximize reward using group relative advantages.

Formally: GRPO objective (Eq 4) maximizing E[min(ratio * A, clip(ratio) * A)] - beta * KL.
Purpose: Integrate offline distillation data dynamically based on exploration needs.

Formally: Weighted Offline Loss L_offline(theta) = w * E[rho * A_offline], where rho is importance sampling ratio using adaptive pi_offline.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters of Qwen2.5-1.5B-MATH

Training Data:

OpenR1-Math-46k-8192 (prompts from NuminaMath 1.5, demonstrations from DeepseekR1)

Key Hyperparameters:

learning_rate_sft: 5e-5
learning_rate_rl: 2e-6
temperature: 1.0
+ 4 more
rollout_iterations: 8
epochs_sft: 3
kl_coefficient: Not reported in the paper
clip_epsilon: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. LUFFY: RED dynamically adjusts the weight of offline data based on entropy *changes* (exploration dynamics) rather than just integrating it as a fixed off-policy term.
vs. ReLIFT: RED is a unified single-stage training process rather than alternating stages, allowing for smoother integration.
vs. SRFT: RED introduces the 'Accuracy-aware Policy Shift' to handle the distribution mismatch of offline data more robustly than simple entropy weighting.
+ 1 more
vs. Pure RLVR (GRPO): RED explicitly incorporates offline SFT data to prevent exploration collapse in small models.

Limitations

Relies on heuristics for entropy weighting and policy shifting which may require tuning for different datasets.
Evaluation is limited to math reasoning tasks; generalization to other domains (coding, logic) is unverified.
Does not report computational costs (GPU hours) for the dynamic calculation overhead.

Reproducibility

Code: https://github.com/millioniron/OpenRLHF-Millioniron-/tree/master

Code is publicly available at https://github.com/millioniron/OpenRLHF-Millioniron-/tree/master. Dataset OpenR1-Math-46k-8192 is used. Base model Qwen2.5-1.5B-MATH is public. Specific hyperparameters like clip epsilon and beta for KL are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using verifiable rewards (correctness of final answer).

Benchmarks:

MATH500 (Mathematical Problem Solving)
AIME 24 (Competition Math (High Difficulty))
AMC (Competition Math)
OlympiadBench (Competition Math)
Minerva Math (Mathematical Reasoning)

Metrics:

pass@1 (Accuracy)
avg@32 (Average accuracy over 32 samples)
Output Token Length (Efficiency)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RED consistently outperforms baselines on major math benchmarks, particularly on harder tasks like AIME and MATH500.
MATH500	pass@1	60.2	65.5	+5.3
MATH500	pass@1	63.8	65.5	+1.7
AIME 24	avg@32	29.2	32.7	+3.5
OlympiadBench	pass@1	45.0	47.7	+2.7
Efficiency analysis shows RED produces concise reasoning chains compared to other methods.
Average across datasets	Output Length (Tokens)	3349	2050	-1299
Ablation studies confirm the necessity of both dynamic entropy regulation and accuracy-aware policy shifts.
MATH500	pass@1	63.2	65.5	+2.3
MATH500	pass@1	61.5	65.5	+4.0

Experiment Figures

Visualization of token probability changes for 'thinking' words (e.g., 'Wait', 'But') across different reasoning stages (Initial, Intermediate, Final).

Curves of SFT Entropy, RL Entropy, and their ratio over training steps.

Main Takeaways

Unified training (integrating SFT and RL) generally outperforms two-stage approaches (SFT then RL) for small models.
The 'Accuracy-aware Policy Shift' is critical; blindly trusting offline data (pi_offline=1) or treating it as fully on-policy (pi_offline=pi) leads to entropy collapse or performance degradation.
RED achieves a 'Goldilocks' zone for reasoning length—shorter than inefficient baselines like ReLIFT/BREAD, but with higher quality reasoning steps that avoid the 'overthinking' trap.
Dynamic entropy regulation successfully maintains a high and stable entropy ratio during training, ensuring the model retains exploration capabilities while learning from offline data.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Supervised Fine-Tuning (SFT)
Group Relative Policy Optimization (GRPO)
Importance Sampling
Entropy in Language Models

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—RL where the reward signal comes from a deterministic check (e.g., math answer correctness)

Recall phase: The RL training phase where the model optimizes reasoning paths within its existing capabilities, typically reducing entropy/exploration

Extend phase: The SFT training phase where the model learns new reasoning patterns from external teacher data, increasing the explorable space

GRPO: Group Relative Policy Optimization—a policy gradient method that normalizes advantages within a group of sampled outputs for the same query, removing the need for a separate value function

SFT: Supervised Fine-Tuning—training on labeled examples (query, response) using maximum likelihood estimation

Entropy: A measure of uncertainty in the model's predictions; high entropy implies diverse exploration, low entropy implies confidence or collapse

Policy Shift: Adjusting the assumed probability of offline data (pi_offline) in the importance sampling weight based on the model's current performance

Importance Sampling: A technique to estimate properties of a target distribution while sampling from a different distribution, using weights to correct for the difference