Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Post-training for Reasoning

Archer improves LLM reasoning by applying distinct RL constraints to knowledge tokens (low-entropy) versus reasoning tokens (high-entropy), stabilizing facts while encouraging logical exploration.

Core Problem

Standard RLVR treats all tokens equally, or incorrectly isolates them via masking, which breaks semantic dependencies and fails to balance factual stability with reasoning exploration.

Why it matters:

Uniform RL updates can destabilize factual knowledge while failing to sufficiently encourage exploration for complex reasoning steps.
Existing methods like gradient masking break the syntactic dependencies between tokens, hindering effective learning of logical patterns.
Batch-level entropy statistics misclassify tokens in responses that have unusually high or low overall entropy.

Concrete Example: In a math problem, low-entropy tokens (e.g., 'The answer is') are factual scaffolding, while high-entropy tokens represent logical leaps. If RL updates the factual tokens too aggressively, the model hallucinates; if it masks them entirely, the flow of the sentence breaks, degrading the learning of the subsequent reasoning tokens.

Key Novelty

Archer (Entropy-Aware Dual-Token Constraints)

Classifies tokens as 'knowledge-related' (low entropy) or 'reasoning-related' (high entropy) using response-level statistics rather than batch-level thresholds.
Applies synchronous updates to all tokens but with differentiated constraints: stronger KL/stricter clipping for knowledge tokens to preserve facts, and weaker KL/looser clipping for reasoning tokens to promote exploration.

Architecture

Pseudocode of the Archer algorithm showing the flow of entropy calculation, token classification, and dual-constraint loss computation.

Evaluation Highlights

+6.6% Pass@1 improvement on AIME24 compared to the standard DAPO baseline.
+5.2% Pass@1 improvement on AIME25 compared to DAPO.
+3.4% Pass@1 improvement on LiveCodeBench v5 compared to DAPO.

Breakthrough Assessment

8/10

Offers a nuanced, theoretically grounded improvement over standard RLVR by addressing the stability-plasticity dilemma at the token level. Significant empirical gains on hard benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) for reasoning tasks (Math, Code)

Inputs: Prompt q (math problem or coding requirement)

Outputs: Generated response o containing reasoning steps and final answer

Pipeline Flow

Policy Model (generates response group)
Reward Verifier (computes rewards)
Entropy Analyzer (calculates response-level quantiles)
Dual-Constraint Optimizer (computes loss with dynamic clip/KL)

System Modules

Policy Model

Generates a group of G responses for a given prompt

Model or implementation: DeepSeek-R1-Distill-Qwen-1.5B / 7B / 14B

Entropy Analyzer

Calculates entropy for each token and determines thresholds per response

Model or implementation: Mathematical calculation (no learnable parameters)

Dual-Constraint Optimizer

Updates model weights using differentiated constraints

Model or implementation: PPO-style optimizer

Novel Architectural Elements

Response-level entropy thresholding mechanism integrated into the loss function computation
Dual-branch constraint application (Clip and KL) within a synchronous GRPO update step

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-1.5B, 7B, 14B

Training Method: Archer (modified GRPO/DAPO)

Objective Functions:

Purpose: Maximize expected reward while keeping policy stable via clipping and KL penalties tailored to token type.

Formally: L = E [ min(r_t A_t, clip(r_t, 1-ε_type, 1+ε_type) A_t) - β_type * KL(π || π_ref) ]

Key Hyperparameters:

reasoning_token_quantile_rho: 0.7 or 0.8
clip_range_reasoning: 0.2
clip_range_knowledge: 0.1
+ 4 more
kl_beta_reasoning: 0.01
kl_beta_knowledge: 0.04
learning_rate: Not explicitly reported in the paper
group_size_G: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. GRPO/DAPO: Archer uses non-uniform constraints based on token entropy.
vs. Wang et al. (2025): Archer updates all tokens synchronously rather than masking gradients, preserving dependencies.
vs. Yang et al. (2025b): Archer uses synchronous updates in one step rather than separate asynchronous steps.
+ 1 more
vs. PPO [not cited in paper]: Archer builds on GRPO which eliminates the critic network, unlike standard PPO.

Limitations

Relies on entropy as a proxy for 'reasoning' vs 'knowledge', which is a heuristic.
Requires tuning additional hyperparameters (quantile thresholds, dual clip/KL values).
Computational overhead of calculating per-token entropy statistics during training.

Reproducibility

Code: https://github.com/wizard-III/ArcherCodeR

Code is publicly available at https://github.com/wizard-III/ArcherCodeR. Paper does not specify exact training duration or GPU resources used.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning and Code generation tasks

Benchmarks:

AIME 2024 (Mathematical Reasoning)
AIME 2025 (Mathematical Reasoning)
MATH-500 (Mathematical Reasoning)
LiveCodeBench (v5, v6) (Code Generation)

Metrics:

Pass@1
Pass@K
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Archer consistently outperforms the strong DAPO baseline across mathematical benchmarks using the DeepSeek-R1-Distill-Qwen-7B base model.
AIME24	Pass@1	51.4	58.0	+6.6
AIME25	Pass@1	41.6	46.8	+5.2
MATH-500	Pass@1	82.4	83.6	+1.2
Archer also shows significant gains in code generation tasks compared to DAPO.
LiveCodeBench v5	Pass@1	50.1	53.5	+3.4
LiveCodeBench v6	Pass@1	44.6	47.2	+2.6

Experiment Figures

Word clouds of top-100 highest and lowest entropy tokens.

Entropy distribution analysis across different prompts and responses.

Main Takeaways

Synchronous updates with differentiated constraints outperform both uniform updates (DAPO) and isolation strategies (masking).
Response-level entropy statistics are more robust than batch-level statistics due to high variance across prompts.
Strict clipping on low-entropy tokens is crucial for preserving factual knowledge, while looser clipping on high-entropy tokens effectively promotes reasoning.
Removing KL penalty entirely leads to training collapse; differentiated KL weights stabilize training better than uniform weights.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Kullback–Leibler (KL) divergence
Token entropy in LLMs

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—post-training method using ground-truth verifiers (e.g., code execution, math answer checking) to guide LLMs.

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, removing the need for a separate value network.

DAPO: Decouple Clip and Dynamic Sampling Policy Optimization—an enhancement of GRPO using techniques like dynamic sampling and token-level loss.

Entropy: A measure of uncertainty in the model's next-token prediction; high entropy suggests branching/reasoning points, low entropy suggests factual/syntactic completion.

KL divergence: A penalty term used in RL to prevent the trained model from drifting too far from the reference model (usually the SFT model).

Pass@1: The percentage of problems where the model generates a correct solution on its first attempt.

Pass@K: The probability that at least one of K generated solutions is correct.

Gradient masking: A technique where gradients for certain tokens are zeroed out to prevent them from being updated during training.

SFT: Supervised Fine-Tuning—the initial training phase on labeled data before RL.

Clipping: Restricting the ratio of the new policy probability to the old policy probability to prevent destructively large updates.