LLM Post-Training: A Deep Dive into Reasoning Large Language Models

📝 Paper Summary

LLM Post-training Reinforcement Learning (RL) for LLMs Test-time Scaling Fine-tuning

A comprehensive survey structuring LLM post-training into three interconnected pillars—fine-tuning, reinforcement learning, and test-time scaling—to address limitations in reasoning, alignment, and adaptability.

Core Problem

Pre-trained LLMs suffer from hallucinations, lack logical consistency in extended discourse, and often fail to align with user intents or ethical standards.

Why it matters:

Models trained purely on next-token prediction struggle with complex reasoning and safety in ambiguous scenarios
Existing surveys often isolate specific techniques (like RLHF or reasoning) without addressing the holistic integration of fine-tuning, RL, and scaling needed for deployment
Critical challenges like catastrophic forgetting, reward hacking, and inference-time trade-offs remain barriers to reliable real-world application

Concrete Example: While an LLM can produce logically coherent-sounding text, it often stumbles on simple logical tasks because it relies on probabilistic patterns rather than explicit symbolic manipulation. Without post-training like RLHF or scaling, it may generate factually incorrect content or fail to correct errors dynamically.

Key Novelty

Integrated Taxonomy of Post-Training

Unifies Fine-Tuning, Reinforcement Learning, and Test-Time Scaling as interconnected optimization strategies rather than isolated steps
Connects historical RL methods (REINFORCE, SCST) to modern LLM breakthroughs (DeepSeek R1, GRPO), providing a lineage of reasoning capability
Categorizes 'reasoning' in LLMs as implicit/probabilistic rather than symbolic, framing it as a sequential decision-making problem solvable via RL

Architecture

A taxonomy diagram of Post-Training Large Language Models, dividing the field into Fine-Tuning, Reinforcement Learning, and Scaling.

Evaluation Highlights

Catalogues over 30 modern models (e.g., DeepSeek-V2, GPT-4, Llama 3) and their specific post-training recipes (RLHF, RLAIF, DPO, GRPO)
Identifies DeepSeek R1 as a key example of 'RL even without human annotation supervised finetuning' using GRPO
Highlights the shift from standard RLHF (PPO) to newer methods like DPO and GRPO in open-weights models like Qwen2 and Llama 3

Breakthrough Assessment

8/10

Highly valuable for structuring a rapidly evolving field. Effectively bridges the gap between classic RL theory and modern LLM post-training practices, specifically highlighting the crucial role of test-time scaling.

⚙️ Technical Details

Problem Definition

Setting: Optimizing a pre-trained LLM policy π_θ to maximize expected return J(π_θ) in a sequential decision-making framework

Inputs: Input prompt or context X

Outputs: Sequence of tokens Y = (y_1, y_2, ..., y_T)

Pipeline Flow

Pre-training (Next-token prediction)
Supervised Fine-Tuning (SFT) / Adaptation
Reinforcement Learning (RLHF/RLAIF/DPO)
Test-Time Scaling (Inference Optimization)

System Modules

Fine-Tuning

Adapt pre-trained models to specific domains or tasks using supervised data

Model or implementation: Various (LoRA, Adapters)

Reinforcement Learning

Optimize sequential decision-making and align widely with human/ethical intent

Model or implementation: PPO, DPO, GRPO, RLAIF

Test-Time Scaling

Enhance performance during generation without weight updates

Model or implementation: Tree-of-Thought (ToT), Search-based techniques, RAG

Novel Architectural Elements

Integration of search-based techniques (Best-of-N, Tree Search) directly into the post-training taxonomy as 'Test Time Scaling'
Framing CoT reasoning steps explicitly as discrete actions in an RL trajectory for optimization

Modeling

Base Model: Survey covers multiple models (DeepSeek-V2, Llama 3, GPT-4, etc.)

Training Method: Survey covers: SFT, RLHF, RLAIF, DPO, GRPO, PPO

Objective Functions:

Purpose: Maximize expected return in RL.

Formally: J(π_θ) = E_{τ~π_θ} [∑_{t=0}^T γ^t R(s_t, a_t)]
Purpose: Policy Gradient update (REINFORCE).

Formally: ∇_θ J(θ) = E[∑_{t=0}^T ∇_θ log π_θ(a_t|s_t) G_t]
Purpose: Minimize risk in MRT.

Formally: J(θ) = ∑_{(x,y) ∈ D} ∑_{y'} P(y'|x;θ) Δ(y', y)
Purpose: Actor-Critic update.

Formally: Minimizing squared error for critic V_ϕ(s_t) against returns

Adaptation: LoRA, Adapters, RAG (Retrieval-Augmented Generation)

Key Hyperparameters:

discount_factor_gamma: Determines influence of future rewards
learning_rate_alpha: Policy update step size
learning_rate_beta: Critic update step size

Compute: Not reported in the paper

Comparison to Prior Work

vs. General RL Surveys: Addresses the specific challenge of infinite vocabulary action spaces and evolving text states in LLMs
vs. Existing LLM Surveys: Explicitly categorizes 'Test Time Scaling' as a core post-training component alongside Fine-tuning and RL
vs. DeepSeek R1 Paper [cited in paper]: This survey contextualizes GRPO within the broader history of RL methods like REINFORCE and SCST

Limitations

Survey scope implies 'reasoning' is probabilistic pattern matching, which may not satisfy symbolic logic definitions
Does not provide unified benchmark results running all listed models on a single evaluation suite
Test-time scaling introduces significant computational overhead and latency not fully quantified for every method

Reproducibility

Code: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training

The paper is a survey but provides a GitHub repository (https://github.com/mbzuai-oryx/Awesome-LLM-Post-training) to track papers and codes. It does not introduce a new model to reproduce but lists reproducible open-source models (e.g., Llama 3, DeepSeek-V2).

📊 Experiments & Results

Evaluation Setup

Survey aggregates methodologies rather than running new experiments. It reviews models that use various benchmarks.

Benchmarks:

General LLM Benchmarks (Text Generation / QA)

Metrics:

Expected Return (RL objective)
BLEU (historically for SCST/MRT)
Reward / Preference Score
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

RL in LLMs is fundamentally different from traditional RL due to the vast, discrete token vocabulary and the evolving sequence state.
Modern post-training has shifted from pure PPO-based RLHF toward more stable or efficient alternatives like DPO and GRPO (e.g., in DeepSeek R1).
Test-time scaling (inference-time search/reasoning) is emerging as a critical third pillar alongside SFT and RL for boosting performance.
Early methods like SCST and MRT laid the groundwork for current reward-based optimization by directly targeting evaluation metrics.

📚 Prerequisite Knowledge

Prerequisites

Markov Decision Processes (MDP)
Reinforcement Learning basics (Policy Gradient, Value Functions)
Language Model Pre-training (Next-token prediction)
Fine-tuning techniques (LoRA, Adapters)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards derived from human preferences

RLAIF: Reinforcement Learning from AI Feedback—using AI models to generate preferences or rewards for training other models

DPO: Direct Preference Optimization—a stable method for aligning models to preferences without training a separate reward model or using PPO

GRPO: Group Relative Policy Optimization—an RL method used in models like DeepSeek R1 that optimizes groups of outputs relative to each other

PPO: Proximal Policy Optimization—an RL algorithm that updates policies in small, constrained steps to ensure stability

CoT: Chain-of-Thought—a prompting or training technique where models generate intermediate reasoning steps before the final answer

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates low-rank matrices instead of full weights

RAG: Retrieval-Augmented Generation—enhancing model outputs by retrieving relevant external documents during inference

SCST: Self-Critical Sequence Training—an RL method where the model's own greedy decoding serves as the baseline for policy updates

Test-time Scaling: Techniques applied during inference (like increasing search depth or width) to improve performance without retraining parameters

MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker