X Fei, S Wang, S Wei, Y Nie, W Shi, H Feng, C Huang
arXiv, 7/2025
(2025)
RLReasoning
📝 Paper Summary
Language Model TrainingReinforcement Learning (RLHF/RLAIF)Reasoning
PCL extends training beyond the standard end-of-sequence token, allowing models to generate and learn from hidden self-evaluations and reward predictions that are discarded during inference to maintain efficiency.
Core Problem
Standard LLM training stops immediately at the end-of-sequence token, preventing models from learning to reflect on or evaluate their completed outputs.
Why it matters:
Current methods like SFT foster passive mimicry rather than active self-assessment
Reinforcement learning usually relies on opaque external reward models, lacking transparency
Valuable 'post-thinking' opportunities for quality assessment are wasted by premature sequence termination
Concrete Example:A model answers a math problem incorrectly but stops generating immediately. It never gets the chance to review its steps, realize the logic error, and calculate a low reward score, which would reinforce the internal concept of 'bad reasoning'.
Key Novelty
Post-Completion Learning (PCL)
Defines a 'post-completion' space after the answer where the model generates self-evaluations and reward predictions during training
Uses a 'white-box' RL approach where the model explicitly learns to calculate its own rewards (accuracy, format, consistency) rather than relying on a black-box external model
Uses a temporary stop token (<post-completion>) to separate inference content from training-only reflection content, ensuring zero inference overhead
Architecture
Conceptual comparison between traditional Black-box RL and PCL's White-box RL, plus the sequence structure.
Evaluation Highlights
Consistent performance improvements over traditional SFT and RL methods on reasoning tasks (quantitative details pending specific table extraction)
Validates effectiveness through 'white-box' reinforcement learning where models internalize reward functions
Maintains inference efficiency by stopping generation before the self-evaluation block during deployment
Breakthrough Assessment
7/10
Cleverly utilizes the 'ignored' space after generation for training signals. It effectively combines self-correction principles with efficient inference, addressing the 'inference cost' bottleneck of methods like Reflexion.
⚙️ Technical Details
Problem Definition
Setting: Language model training for reasoning tasks optimizing both generation quality and self-evaluation accuracy
Inputs: Input prompt x
Outputs: Reasoning chain 'think', answer, self-evaluation, and predicted reward
Pipeline Flow
Input Processing
Reasoning Generation (Think + Answer)
Evaluation Generation (Evaluation + Reward)
System Modules
Reasoning Region Generator
Generates the problem-solving process and final answer
Model or implementation: Language Model (e.g., Llama/Qwen based)
Reflection Region Generator
Generates self-assessment and predicts rewards (not used during inference)
Model or implementation: Same Language Model (shared weights)
Novel Architectural Elements
Sequence structure partition: Reasoning Region vs. Reflection Region separated by <post-completion> token
Internalized Reward Model: The model itself predicts the reward score as part of the sequence generation
Modeling
Base Model: Not explicitly specified (experiments mention 'several language models' generally)
Training Method: Hybrid SFT + GRPO (Group Relative Policy Optimization)
Objective Functions:
Purpose: SFT for Reasoning.
Formally: Cross-entropy loss on think and answer tokens only.
Purpose: SFT for Evaluation.
Formally: Cross-entropy loss on evaluation and reward tokens (conditioned on fixed reasoning context).
Purpose: Reinforcement Learning.
Formally: GRPO loss maximizing a combined reward signal (Accuracy + Format + Consistency).
Training Data:
Teacher model generates PCL-format data via In-Context Learning (ICL)
Automatic validation retains samples where the model correctly evaluates answer accuracy (even if the answer itself is wrong)
Key Hyperparameters:
kl_beta: 0.04
group_size: 8 (for GRPO sampling)
Compute: Not reported in the paper
Comparison to Prior Work
vs. Self-Refine/Reflexion: PCL internalizes evaluation during training for zero-overhead inference, whereas Self-Refine/Reflexion require expensive test-time generation cycles
vs. RLHF (Standard): PCL uses 'white-box' RL where the model learns to calculate the reward itself, rather than optimizing against an opaque external reward model
vs. STaR [not cited in paper]: PCL explicitly trains a reflection/evaluation component, whereas STaR filters positive reasoning traces without explicit self-evaluation training
Limitations
No specific model sizes or dataset names provided in the text
Relies on a teacher model to bootstrap the initial PCL training data
Theoretical convergence assumes bounded gradients and Lipschitz continuity
Success depends on the model's ability to 'learn' the reward function logic, which may be difficult for complex subjective tasks
Reproducibility
Code availability is not provided. The paper describes the data construction process (using a teacher model with ICL) and the reward functions (Accuracy, Format, Consistency) in detail, which aids reimplementation, but specific datasets and base models are not enumerated in the provided text.
📊 Experiments & Results
Evaluation Setup
Reasoning tasks evaluated on output quality and self-evaluation accuracy
Benchmarks:
Not explicitly named in text (Reasoning tasks)
Metrics:
Accuracy Reward (0/1)
Format Reward (Completeness of think/answer/eval/reward)
Consistency Reward (L1 distance between predicted and true score)
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
PCL achieves consistent performance improvements over traditional SFT and RL methods on reasoning tasks (qualitative statement, exact numbers not in text)
Ablation studies validate the effectiveness of the post-completion space learning
Theoretical analysis suggests PCL has better sample complexity than sequential SFT->RL training by avoiding catastrophic forgetting
📚 Prerequisite Knowledge
Prerequisites
Reinforcement Learning from Human Feedback (RLHF)
Supervised Fine-Tuning (SFT)
Chain-of-thought (CoT) prompting
Key Terms
PCL: Post-Completion Learning—a training framework that utilizes sequence space after the standard end-of-sequence token for self-evaluation
<post-completion>: A special token inserted between the answer and the self-evaluation section; serves as a stop token during inference but a continuation point during training
SFT: Supervised Fine-Tuning—training a model on high-quality demonstration data
RLHF: Reinforcement Learning from Human Feedback—optimizing models using reward signals derived from human preferences
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on group-wise preferences without a separate value function
White-box RL: A reinforcement learning approach where the model explicitly learns the logic of the reward function rather than treating it as a black box
ICL: In-Context Learning—providing examples within the prompt to guide model behavior
CoT: Chain-of-thought—a prompting technique where models generate intermediate reasoning steps before the final answer