Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

📝 Paper Summary

Large Reasoning Models (LRMs) Reinforced Reasoning

This survey conceptualizes a roadmap for transitioning from Conversational AI to Large Reasoning Models by combining automated data synthesis, reinforcement learning, and test-time search strategies.

Core Problem

Standard LLMs trained on next-token prediction lack the ability to perform complex, multi-step logical reasoning (System 2 thinking), and human annotation for such reasoning trajectories is prohibitively expensive to scale.

Why it matters:

A 'deficit of thought' limits LLMs from generalizing to complex real-world problems like coding, math, and autonomous agents.
Relying solely on human annotation is unsustainable for the massive data volumes needed to supervise step-by-step reasoning.
Existing scaling laws for model size (train-time) need to be augmented with test-time compute scaling to achieve AGI-level reasoning.

Concrete Example: When solving a complex math problem, a standard LLM might try to guess the answer immediately (System 1). In contrast, a reasoning model needs to generate intermediate 'thought' tokens, verify them, and backtrack if necessary (System 2), but lacks the training data to learn this process effectively from human demonstrators.

Key Novelty

The 'Reinforced Reasoning' Paradigm

Integrates 'Search' (generating reasoning trajectories via trial-and-error) and 'Learning' (training on those trajectories via RL) into a reinforced cycle.
Shifts the supervision signal from sparse final answers to dense Process Reward Models (PRMs) that evaluate intermediate steps.
Proposes scaling reasoning not just by model size, but by increasing 'thinking' tokens during inference (Test-time scaling).

Architecture

A comparison of data construction methods, contrasting Human Annotation with LLM Automation.

Breakthrough Assessment

8/10

Provides a timely and comprehensive taxonomy for the emerging field of Large Reasoning Models (like OpenAI o1), unifying data construction, RL, and search into a coherent roadmap.

⚙️ Technical Details

Problem Definition

Setting: Training and deploying Large Language Models (LLMs) to perform multi-step reasoning tasks.

Inputs: Complex natural language queries (e.g., math problems, code generation requests).

Outputs: A sequence of reasoning steps ('thoughts') followed by a final correct answer.

Pipeline Flow

Data Construction (Human vs. LLM)
Supervised Fine-Tuning (SFT)
Reinforced Reasoning (RL + PRM)
Test-time Search

System Modules

Data Constructor

Generate reasoning trajectories for training

Model or implementation: Hybrid (Human + LLM)

Process Reward Model (PRM)

Provide dense feedback signals for each reasoning step

Model or implementation: Reward Model

Reasoning Policy

Generate thoughts and final answers

Model or implementation: Large Reasoning Model (LRM)

Novel Architectural Elements

Integration of a 'thought' sequence (intermediate tokens) as a core architectural component distinct from simple generation.
Usage of Process Reward Models (PRMs) as a dense supervision signal for Reinforcement Learning.
Agentic workflows that program 'thinking patterns' via loops of generation and verification.

Modeling

Base Model: Generic Large Language Models (Transformer-based)

Training Method: Reinforced Reasoning Cycle (Self-Training / RL)

Objective Functions:

Purpose: Pre-training foundation.

Formally: Next-token prediction loss on large text corpora.
Purpose: Alignment and Reasoning Refinement.

Formally: Reinforcement Learning (e.g., PPO or DPO) maximizing rewards from PRMs and outcome verifiers.

Training Data:

Human Annotation: High quality but expensive (e.g., PRM800K).
LLM Automation: Scalable trial-and-error search with feedback/verification.

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard LLMs: LRMs explicitly model 'thought' processes and are trained via RL to optimize reasoning paths rather than just mimicking text.
vs. Pure Prompting (CoT/ToT): Reinforced Reasoning 'internalizes' the search and verification process into the model parameters via training, rather than relying solely on inference-time prompts.
vs. Manual Annotation: The survey emphasizes the shift to *automated* data construction via LLM-driven search to scale beyond human limits.

Limitations

The text provided is truncated and does not contain the sections on specific RL algorithms (Sec 4) or Test-time scaling details (Sec 5).
No specific experimental results or benchmark tables were included in the provided text segments.
The survey relies on existing literature and does not present a new model with its own performance metrics.

Reproducibility

This is a survey paper. It references 'Process Reward Models' and 'OpenAI o1' but does not introduce a specific new open-source model itself. It mentions open-source projects like OpenR, LLaMA-Berry, and Journey Learning as efforts to reproduce reasoning capabilities.

📊 Experiments & Results

Evaluation Setup

Benchmarking reasoning capabilities across mathematical, coding, and logical tasks.

Benchmarks:

Benchmarks mentioned conceptually (Mathematical reasoning, Code generation, Logical deduction)

Metrics:

Reasoning Accuracy
Pass@k
Statistical methodology: Not reported in the paper

Main Takeaways

The transition to Large Reasoning Models requires a shift from 'Learning to Speak' (next-token prediction) to 'Learning to Reason' (Reinforcement Learning on thoughts).
Data construction is moving from expensive human annotation to scalable LLM-driven search and verification cycles.
Effective reasoning requires both train-time scaling (better models/data) and test-time scaling (more compute/search during inference), as exemplified by the OpenAI o1 series.

📚 Prerequisite Knowledge

Prerequisites

Basic Transformer architecture and Next-Token Prediction
Reinforcement Learning fundamentals (Rewards, Policy)
Prompting strategies (Chain-of-Thought)

Key Terms

LRM: Large Reasoning Model—an evolution of LLMs designed specifically for complex multi-step reasoning (e.g., OpenAI o1).

PRM: Process Reward Model—a model trained to score the correctness of intermediate reasoning steps rather than just the final answer.

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.

RLHF: Reinforcement Learning from Human Feedback—aligning models using rewards derived from human preferences.

DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without an explicit reward model loop.

SFT: Supervised Fine-Tuning—training a pre-trained model on labeled instruction-response pairs.

In-context Learning: The ability of an LLM to adapt to a task given a few examples in the prompt without parameter updates.

Test-time scaling: Improving model performance by allocating more computation (generating more tokens/thoughts) during inference.

System 2 thinking: Deliberate, slow, and logical reasoning, as opposed to fast, intuitive System 1 thinking.