← Back to Paper List

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Fahad Shahbaz Khan, Salman Khan
Mohamed bin Zayed University of Artificial Intelligence, Center for Research in Computer Vision at the University of Central Florida, University of California at Merced, Department of Engineering Science
arXiv (2025)
RL Reasoning Pretraining Benchmark

📝 Paper Summary

LLM Post-training Reinforcement Learning (RL) for LLMs Test-time Scaling Fine-tuning
A comprehensive survey structuring LLM post-training into three interconnected pillars—fine-tuning, reinforcement learning, and test-time scaling—to address limitations in reasoning, alignment, and adaptability.
Core Problem
Pre-trained LLMs suffer from hallucinations, lack logical consistency in extended discourse, and often fail to align with user intents or ethical standards.
Why it matters:
  • Models trained purely on next-token prediction struggle with complex reasoning and safety in ambiguous scenarios
  • Existing surveys often isolate specific techniques (like RLHF or reasoning) without addressing the holistic integration of fine-tuning, RL, and scaling needed for deployment
  • Critical challenges like catastrophic forgetting, reward hacking, and inference-time trade-offs remain barriers to reliable real-world application
Concrete Example: While an LLM can produce logically coherent-sounding text, it often stumbles on simple logical tasks because it relies on probabilistic patterns rather than explicit symbolic manipulation. Without post-training like RLHF or scaling, it may generate factually incorrect content or fail to correct errors dynamically.
Key Novelty
Integrated Taxonomy of Post-Training
  • Unifies Fine-Tuning, Reinforcement Learning, and Test-Time Scaling as interconnected optimization strategies rather than isolated steps
  • Connects historical RL methods (REINFORCE, SCST) to modern LLM breakthroughs (DeepSeek R1, GRPO), providing a lineage of reasoning capability
  • Categorizes 'reasoning' in LLMs as implicit/probabilistic rather than symbolic, framing it as a sequential decision-making problem solvable via RL
Evaluation Highlights
  • Catalogues over 30 modern models (e.g., DeepSeek-V2, GPT-4, Llama 3) and their specific post-training recipes (RLHF, RLAIF, DPO, GRPO)
  • Identifies DeepSeek R1 as a key example of 'RL even without human annotation supervised finetuning' using GRPO
  • Highlights the shift from standard RLHF (PPO) to newer methods like DPO and GRPO in open-weights models like Qwen2 and Llama 3
Breakthrough Assessment
8/10
Highly valuable for structuring a rapidly evolving field. Effectively bridges the gap between classic RL theory and modern LLM post-training practices, specifically highlighting the crucial role of test-time scaling.
×