Post-Completion Learning for Language Models

📝 Paper Summary

Language Model Training Reinforcement Learning (RLHF/RLAIF) Reasoning

PCL extends training beyond the standard end-of-sequence token, allowing models to generate and learn from hidden self-evaluations and reward predictions that are discarded during inference to maintain efficiency.

Core Problem

Standard LLM training stops immediately at the end-of-sequence token, preventing models from learning to reflect on or evaluate their completed outputs.

Why it matters:

Current methods like SFT foster passive mimicry rather than active self-assessment
Reinforcement learning usually relies on opaque external reward models, lacking transparency
Valuable 'post-thinking' opportunities for quality assessment are wasted by premature sequence termination

Concrete Example: A model answers a math problem incorrectly but stops generating immediately. It never gets the chance to review its steps, realize the logic error, and calculate a low reward score, which would reinforce the internal concept of 'bad reasoning'.

Key Novelty

Post-Completion Learning (PCL)

Defines a 'post-completion' space after the answer where the model generates self-evaluations and reward predictions during training
Uses a 'white-box' RL approach where the model explicitly learns to calculate its own rewards (accuracy, format, consistency) rather than relying on a black-box external model
Uses a temporary stop token (<post-completion>) to separate inference content from training-only reflection content, ensuring zero inference overhead

Architecture

Conceptual comparison between traditional Black-box RL and PCL's White-box RL, plus the sequence structure.

Evaluation Highlights

Consistent performance improvements over traditional SFT and RL methods on reasoning tasks (quantitative details pending specific table extraction)
Validates effectiveness through 'white-box' reinforcement learning where models internalize reward functions
Maintains inference efficiency by stopping generation before the self-evaluation block during deployment

Breakthrough Assessment

7/10

Cleverly utilizes the 'ignored' space after generation for training signals. It effectively combines self-correction principles with efficient inference, addressing the 'inference cost' bottleneck of methods like Reflexion.

⚙️ Technical Details

Problem Definition

Setting: Language model training for reasoning tasks optimizing both generation quality and self-evaluation accuracy

Inputs: Input prompt x

Outputs: Reasoning chain 'think', answer, self-evaluation, and predicted reward

Pipeline Flow

Input Processing
Reasoning Generation (Think + Answer)
Evaluation Generation (Evaluation + Reward)

System Modules

Reasoning Region Generator

Generates the problem-solving process and final answer

Model or implementation: Language Model (e.g., Llama/Qwen based)

Reflection Region Generator

Generates self-assessment and predicts rewards (not used during inference)

Model or implementation: Same Language Model (shared weights)

Novel Architectural Elements

Sequence structure partition: Reasoning Region vs. Reflection Region separated by <post-completion> token
Internalized Reward Model: The model itself predicts the reward score as part of the sequence generation

Modeling

Base Model: Not explicitly specified (experiments mention 'several language models' generally)

Training Method: Hybrid SFT + GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: SFT for Reasoning.

Formally: Cross-entropy loss on think and answer tokens only.
Purpose: SFT for Evaluation.

Formally: Cross-entropy loss on evaluation and reward tokens (conditioned on fixed reasoning context).
Purpose: Reinforcement Learning.

Formally: GRPO loss maximizing a combined reward signal (Accuracy + Format + Consistency).

Training Data:

Teacher model generates PCL-format data via In-Context Learning (ICL)
Automatic validation retains samples where the model correctly evaluates answer accuracy (even if the answer itself is wrong)

Key Hyperparameters:

kl_beta: 0.04
group_size: 8 (for GRPO sampling)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Self-Refine/Reflexion: PCL internalizes evaluation during training for zero-overhead inference, whereas Self-Refine/Reflexion require expensive test-time generation cycles
vs. RLHF (Standard): PCL uses 'white-box' RL where the model learns to calculate the reward itself, rather than optimizing against an opaque external reward model
vs. STaR [not cited in paper]: PCL explicitly trains a reflection/evaluation component, whereas STaR filters positive reasoning traces without explicit self-evaluation training

Limitations

No specific model sizes or dataset names provided in the text
Relies on a teacher model to bootstrap the initial PCL training data
Theoretical convergence assumes bounded gradients and Lipschitz continuity
Success depends on the model's ability to 'learn' the reward function logic, which may be difficult for complex subjective tasks

Reproducibility

Code availability is not provided. The paper describes the data construction process (using a teacher model with ICL) and the reward functions (Accuracy, Format, Consistency) in detail, which aids reimplementation, but specific datasets and base models are not enumerated in the provided text.

📊 Experiments & Results

Evaluation Setup

Reasoning tasks evaluated on output quality and self-evaluation accuracy

Benchmarks:

Not explicitly named in text (Reasoning tasks)

Metrics:

Accuracy Reward (0/1)
Format Reward (Completeness of think/answer/eval/reward)
Consistency Reward (L1 distance between predicted and true score)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

PCL achieves consistent performance improvements over traditional SFT and RL methods on reasoning tasks (qualitative statement, exact numbers not in text)
Ablation studies validate the effectiveness of the post-completion space learning
Theoretical analysis suggests PCL has better sample complexity than sequential SFT->RL training by avoiding catastrophic forgetting

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Supervised Fine-Tuning (SFT)
Chain-of-thought (CoT) prompting

Key Terms

PCL: Post-Completion Learning—a training framework that utilizes sequence space after the standard end-of-sequence token for self-evaluation

<post-completion>: A special token inserted between the answer and the self-evaluation section; serves as a stop token during inference but a continuation point during training

SFT: Supervised Fine-Tuning—training a model on high-quality demonstration data

RLHF: Reinforcement Learning from Human Feedback—optimizing models using reward signals derived from human preferences

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on group-wise preferences without a separate value function

White-box RL: A reinforcement learning approach where the model explicitly learns the logic of the reward function rather than treating it as a black box

ICL: In-Context Learning—providing examples within the prompt to guide model behavior

CoT: Chain-of-thought—a prompting technique where models generate intermediate reasoning steps before the final answer