Optimizing RLHF Training for Large Language Models with Stage Fusion

📝 Paper Summary

RLHF System Efficiency Distributed Training GPU Utilization

RLHFuse improves RLHF training throughput by decomposing tasks into subtasks to enable inter-stage fusion (overlapping generation/inference) and intra-stage fusion (interleaving Actor/Critic training pipelines).

Core Problem

RLHF training suffers from low GPU utilization due to data skewness (long-tail generation blocking inference) and pipeline bubbles (idle time in pipeline parallelism) during the training stage.

Why it matters:

Current frameworks treat RLHF tasks as atomic units, missing optimization opportunities within the internal structure of tasks
Long-tail samples (top 0.1%) can dominate generation time, forcing most GPUs to sit idle while waiting for the longest response to finish
As models scale to hundreds of billions of parameters requiring large Pipeline Parallelism (PP) sizes, pipeline bubbles can waste ~50% of training resources

Concrete Example: In the LMSYS-Chat-1M dataset, the 99.9th percentile output length is >10x the median. During generation, once 99% of samples finish, the entire inference stage is blocked waiting for the final 1% of long samples to complete on just a few GPUs, leaving the rest idle.

Key Novelty

Subtask-level Stage Fusion (RLHFuse)

**Data-aware Inter-stage Fusion:** dynamic migration of long-tail samples to a subset of GPUs during generation, allowing the freed GPUs to immediately start the dependent inference tasks
**Model-aware Intra-stage Fusion:** utilizes a bidirectional pipeline schedule to train the Actor and Critic models simultaneously on the same GPUs, using one model's computation to fill the other's pipeline bubbles

Architecture

Overview of RLHFuse illustrating Inter-stage and Intra-stage fusion.

Evaluation Highlights

Increases RLHF training throughput by up to 3.7x compared to existing systems (state-of-the-art frameworks)
Reduces the impact of long-tail generation latency where the longest samples account for >50% of generation time in large models
Effectively mitigates pipeline bubbles which typically consume ~50% of cycles in standard 1F1B schedules for large models

Breakthrough Assessment

8/10

Addresses two fundamental efficiency bottlenecks in RLHF (skew and bubbles) with a novel system-level scheduling approach, offering significant throughput gains without altering model semantics.

⚙️ Technical Details

Problem Definition

Setting: RLHF training (PPO) involving four models (Actor, Reference, Reward, Critic) executed across distributed GPU clusters

Inputs: Prompt dataset, Pre-trained Actor/Critic/Reward/Reference models

Outputs: Aligned Actor model

Pipeline Flow

Generation Stage: Actor generates responses (Prefill + Decoding)
Inference Stage: Reference, Reward, and Critic models compute values/scores on generated samples
Training Stage: Actor and Critic models update weights via PPO

System Modules

Actor Model

Generate responses (rollouts) and update policy based on feedback

Model or implementation: Target LLM (e.g., hundreds of billions parameters)

Critic Model

Evaluate actions (tokens) to guide the Actor

Model or implementation: Initialized from Reward Model

Reward Model (RW) (Inference)

Provide scalar rewards for completed responses

Model or implementation: Frozen LLM

Reference Model (Ref) (Inference)

Compute KL divergence to prevent mode collapse

Model or implementation: Frozen copy of initialized Actor

Novel Architectural Elements

Inter-stage Fusion Controller: Dynamically migrates long-tail generation tasks to dedicated GPUs to launch inference tasks early on freed resources
Bidirectional Pipeline Scheduler: Interleaves Actor and Critic micro-batches in the training stage to fill pipeline bubbles

Modeling

Base Model: Evaluated on models scaling to 'hundreds of billions' of parameters (exact model names for experiments not in text snippet)

Training Method: PPO (Proximal Policy Optimization)

Training Data:

LMSYS-Chat-1M dataset used for analyzing length distribution

Compute: Supports large Pipeline Parallelism (PP) sizes (e.g., dozens of stages) where bubble ratio approaches 50%

Comparison to Prior Work

vs. ReaLHF/HybridFlow: RLHFuse optimizes at the *subtask* level (intra-stage/inter-stage fusion) rather than just the task level, addressing internal bubbles and skew
vs. Standard 1F1B (Megatron-LM): RLHFuse uses a fused pipeline schedule for two models (Actor+Critic) simultaneously rather than optimizing single-model pipelines
vs. vLLM/LightLLM [not cited in paper]: Unlike inference-only engines that optimize decoding, RLHFuse optimizes the transition *between* generation and training stages

Limitations

Depends on the presence of distinct Actor and Critic models to fill pipeline bubbles (may not apply if using shared parameters without detachment)
Complexity of managing dynamic migration of KV-cache/state during generation for inter-stage fusion
Specifics of the migration algorithm overhead are not detailed in the provided snippet

Reproducibility

Code availability is not provided in the text. The paper describes an internal system ('production deployments', 'internal products'). Artifacts like specific prompts or weights are not mentioned.

📊 Experiments & Results

Evaluation Setup

RLHF training workflow including Generation, Inference, and Training stages

Benchmarks:

LMSYS-Chat-1M (Analysis of output length distribution)

Metrics:

Training Throughput
Iteration Time
GPU Utilization
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Internal Benchmarks	Training Throughput	1.0x (normalized)	3.7x (normalized)	+270%
LMSYS-Chat-1M	Output Length Skew	1.0x (normalized)	>10x (normalized)	>10x

Experiment Figures

CDF of output length distribution (LMSYS-Chat-1M) and iteration time breakdown.

Comparison of pipeline bubbles in standard 1F1B vs. Interleaved 1F1B schedules.

Main Takeaways

Long-tail samples (<1% of data) cause disproportionate delays in the generation stage, often accounting for >50% of total generation time.
Pipeline bubbles in standard 1F1B schedules approach 50% idle time as the number of pipeline stages (N) approaches the number of micro-batches (M), common in large model training.
Fusing execution stages (Inter-stage and Intra-stage) significantly improves throughput (up to 3.7x) by reclaiming idle resources from data skew and pipeline bubbles.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Pipeline Parallelism (PP) and 1F1B schedule
Distributed LLM training (Data Parallelism, Tensor Parallelism)
PPO Algorithm (Actor-Critic architecture)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—a method to align LLMs with human intent using a reward model

PPO: Proximal Policy Optimization—the standard RL algorithm used here, involving an Actor (policy) and Critic (value function)

Pipeline Bubbles: Idle time in Pipeline Parallelism where GPUs wait for data from previous stages or gradients from later stages

1F1B: One-Forward-One-Backward—a standard pipeline parallelism schedule

Data Skewness: The phenomenon where a small percentage of generated responses are significantly longer than average, causing load imbalance

Micro-batches: Small chunks of a data batch processed sequentially in pipeline parallelism to reduce bubble size

Actor: The main LLM being trained to generate responses

Critic: The value model that estimates the expected reward of the Actor's actions

Reference Model: A frozen copy of the Actor used to calculate KL divergence penalties

Reward Model: A frozen model that assigns scores to the Actor's generated responses