Language Models that Think, Chat Better

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Chain-of-Thought Reasoning Post-training pipelines

RLMT extends the success of reinforcement learning with reasoning traces beyond math/code to general open-ended chat by optimizing long chain-of-thought generation against a standard preference-based reward model.

Core Problem

Current reasoning models trained via RL with verifiable rewards (RLVR) excel at math and code but generalize poorly to open-ended tasks like creative writing or general chat.

Why it matters:

Humans routinely use planning and reasoning for everyday open-ended tasks (e.g., writing essays, planning meals), not just math puzzles
Math-focused reasoning models (like DeepSeek-R1-Zero) lag behind standard instruct models on general chat benchmarks
Skills acquired from verifiable domains (math/code) do not naturally transfer to general reasoning tasks

Concrete Example: When asked to write a travel blog post, a math-optimized reasoning model might fail to structure the narrative engagingly or hallucinate rigid constraints, whereas RLMT plans themes and constraints explicitly before writing, resulting in a richer output.

Key Novelty

Reinforcement Learning with Model-rewarded Thinking (RLMT)

Applies the 'thinking' paradigm (generating long CoT before answering) to general chat domains where ground-truth verification is impossible
Replaces the rule-based verifiers of RLVR with a learned preference reward model (like in RLHF) to score the final response
Demonstrates that reasoning capabilities can be elicited in base models without supervised fine-tuning (SFT) or with SFT warm-starts using only 7K prompts

Architecture

Conceptual comparison between Standard RLHF, RLVR, and the proposed RLMT pipeline.

Evaluation Highlights

+3 to +7 points on chat benchmarks (AlpacaEval2, WildBench, ArenaHardV2) compared to standard RLHF baselines
Llama-3.1-8B trained with RLMT outperforms GPT-4o and rivals Claude-3.7-Sonnet (Thinking) on WildBench and creative writing tasks
Base Llama-3.1-8B with 'Zero' RLMT (no SFT) outperforms the official Llama-3.1-8B-Instruct model (tuned on 25M+ examples) by >5 points on chat benchmarks

Breakthrough Assessment

9/10

Successfully bridges the gap between 'reasoning' models (usually restricted to math/code) and general chat, showing that long CoT improves open-ended generation significantly even with small data and models.

⚙️ Technical Details

Problem Definition

Setting: Open-ended instruction following and chat generation

Inputs: Natural language prompt x

Outputs: Reasoning trace z followed by final response y

Pipeline Flow

Input Prompt
Thinking Generation (Model generates trace z)
Response Generation (Model generates answer y)
Reward Evaluation (Reward Model scores y)

System Modules

Policy Model

Generates both the reasoning trace and the final response

Model or implementation: Llama-3.1-8B or Qwen-2.5-7B (Base or Instruct)

Reward Model

Evaluates the quality of the final response to provide a training signal

Model or implementation: Skywork-v1-Llama-3.1-8B-v0.2 (or Skywork-V2/ArmoRM in ablations)

Novel Architectural Elements

Integration of preference-based reward models with long-CoT generation for non-verifiable domains (open-ended chat)
Application of RLMT directly to base models ('Zero' setting) using fixed instruction prefixes without SFT

Modeling

Base Model: Llama-3.1-8B and Qwen-2.5-7B (both Base and Instruct versions)

Training Method: Reinforcement Learning (GRPO, PPO, or on-policy DPO)

Objective Functions:

Purpose: Maximize expected reward of the final response given the reasoning trace.

Formally: J(θ) = E_{x~D, z~π(·|x), y~π(·|x,z)} [r(y,x)]

Training Data:

RL Prompts: 7.5k prompts from WildChat-IF (subset of Tulu 3 SFT mixture)
SFT Data: 6k prompts from Tulu 3 (disjoint from RL set), responses generated by Gemini 2.5 Flash

Key Hyperparameters:

rl_algorithm: GRPO (primary), PPO, DPO
reward_model: Skywork-v1-Llama-3.1-8B-v0.2

Compute: Not reported in the paper

Comparison to Prior Work

vs. RLVR: Uses learned preference models instead of ground-truth verifiers, enabling application to open-ended chat
vs. Standard RLHF: Enforces long chain-of-thought generation before the final answer
vs. Llama-3.1-Instruct: Achieves better performance with orders of magnitude less data (7k vs 25M+) and simpler pipeline

Limitations

Inference latency increases due to the generation of long reasoning traces
Performance depends heavily on the quality of the reward model (weak RMs lead to degradation)
Mixed results on ArenaHardV2 compared to some frontier models (likely due to math/code focus of that benchmark)
Thinking process is not directly supervised during RL, only the final outcome

Reproducibility

Code: https://github.com/princeton-pli/RLMT

publicly available (https://github.com/princeton-pli/RLMT). Models and code released. Prompts from WildChat-IF and Tulu 3 are public datasets.

📊 Experiments & Results

Evaluation Setup

General-purpose chat and reasoning evaluation

Benchmarks:

AlpacaEval 2 (AE2) (General Chat (LC win rate))
WildBench (WB) (General Chat (WB-Score))
ArenaHardV2 (AH2) (Hard Chat/Reasoning)
CreativeWritingV3 (CWv3) (Creative Writing)
IFBench (IFBen) (Instruction Following)
MMLU-Redux (MMLUR) (General Knowledge)
PopQA (Long-tail Factuality)

Metrics:

Win Rates (AlpacaEval2)
WB-Score (WildBench)
Accuracy/Score (MMLU, PopQA, IFBench)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RLMT (GRPO) significantly outperforms standard RLHF baselines across chat and writing benchmarks.
WildBench	WB-Score	40.9	50.4	+9.5
AlpacaEval 2	LC Win Rate	27.6	58.7	+31.1
CreativeWritingV3	Score	48.9	56.4	+7.5
Zero-shot RLMT applied to base models (without SFT) competes with or beats heavily tuned Instruct models.
WildBench	WB-Score	40.9	46.4	+5.5
AlpacaEval 2	Score	39.6	43.7	+4.1
Ablation showing importance of prompt difficulty.
WildBench	WB-Score	44.2	50.4	+6.2

Experiment Figures

Radar chart comparing RLMT models against math-focused reasoning models on WildBench.

Analysis of thinking traits before and after RLMT.

Evolution of thought length and response length during training.

Main Takeaways

RLMT consistently improves performance over non-thinking RLHF baselines across all algorithms (GRPO, PPO, DPO), with GRPO performing best.
The method works effectively even without SFT warm-starts (Zero setting), allowing base models to outperform their Instruct counterparts tuned on millions of examples.
Qualitative analysis reveals that RLMT shifts model behavior from linear, checklist-style planning to richer, iterative constraint satisfaction and thematic grouping.
Prompt mixture quality is critical; using diverse, real-world chat prompts (WildChat-IF) yields significantly better general chat performance than standard mixtures like UltraChat.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Chain-of-Thought (CoT) prompting
Reward Modeling

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—optimizing models using ground-truth checkers (e.g., math answers, unit tests)

RLMT: Reinforcement Learning with Model-rewarded Thinking—the proposed method using preference models to reward reasoning traces in open domains

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance

CoT: Chain-of-Thought—a technique where models generate intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training models on labeled examples (prompt, response pairs) before RL

PPO: Proximal Policy Optimization—a standard on-policy RL algorithm

DPO: Direct Preference Optimization—an offline preference learning algorithm usually used without explicit reward modeling, adapted here for on-policy learning

RLHF: Reinforcement Learning from Human Feedback—aligning models using a reward model trained on human preferences