DiffStitch: Boosting Offline Reinforcement Learning with Diffusion-based Trajectory Stitching

📝 Paper Summary

Offline Reinforcement Learning Data Augmentation

DiffStitch augments offline RL datasets by using diffusion models to generate realistic transition sub-trajectories that stitch low-reward trajectories to high-reward ones, enabling agents to learn paths to optimal regions.

Core Problem

Offline RL datasets often lack optimal trajectories or have disjoint low-reward and high-reward regions, preventing agents from learning how to transit to high-reward states.

Why it matters:

Offline datasets in real-world scenarios (healthcare, autonomous driving) are often suboptimal and fragmented, limiting policy performance.
Existing augmentation methods generate short random branches without a target, failing to connect the agent to high-reward regions effectively.
Naive stitching (masking and filling) often creates physically impossible transitions because it cannot determine the correct number of time steps between disjoint states.

Concrete Example: Consider a navigation task where one trajectory starts at S but gets low reward, and another disjoint trajectory ends at goal G with high reward. A standard offline RL agent cannot learn to go S→G because no data connects them. Existing augmentations might branch out from S randomly but never hit the specific path to G.

Key Novelty

Diffusion-based Trajectory Stitching (DiffStitch)

Systematically connects any two trajectories (e.g., a low-reward start and high-reward end) by generating a bridging sub-trajectory.
Uses a 'step estimation' module to first predict exactly how many time steps are needed to transit between two disjoint states, ensuring temporal consistency.
Generates the bridging states using a diffusion model conditioned on the estimated steps, then fills in actions and rewards with inverse dynamics models.

Architecture

The complete DiffStitch pipeline for generating augmented data.

Evaluation Highlights

DiffStitch improves IQL performance by +16.8% on average across D4RL locomotion datasets compared to vanilla IQL.
DiffStitch combined with TD3+BC achieves a score of 109.1 on hopper-medium-expert-v2, outperforming the vanilla baseline of 98.0.
On the sparse-reward antmaze-umaze-v0 dataset, DiffStitch boosts IQL success rate from 89.5 to 95.3.

Breakthrough Assessment

7/10

Solid contribution to data augmentation for offline RL. The explicit step estimation before generation addresses a key technical hurdle in trajectory stitching (temporal consistency). Improvements are consistent across multiple algorithm types.

⚙️ Technical Details

Problem Definition

Setting: Offline Reinforcement Learning in a Markov Decision Process (MDP)

Inputs: Offline dataset D containing static trajectories

Outputs: Augmented dataset D* containing original plus synthesized stitching trajectories

Pipeline Flow

Trajectory Selection: Pick low-reward traj τ and high-reward traj τ'
Step Estimation Module: Estimate steps Δ needed to go from end of τ to start of τ'
State Stitching Module: Generate Δ states connecting the two trajectories
Trajectory Wrap-up Module: Predict actions and rewards for the new states
Qualification Module: Filter generated trajectories based on dynamic consistency

System Modules

Step Estimation Module

Determine the temporal distance (number of steps) required to transition between two disjoint states.

Model or implementation: Conditional Generative Model (predicts future states) + Cosine Similarity matching

State Stitching Module

Generate the sequence of states that bridge the gap.

Model or implementation: Diffusion Model (DiffStitch)

Trajectory Wrap-up Module

Fill in missing actions and rewards for the generated state sequence.

Model or implementation: Inverse Dynamics Model (f_psi) and Reward Model (f_phi)

Qualification Module

Filter out unrealistic trajectories that violate environment dynamics.

Model or implementation: Pre-trained Dynamics Model (f_omega)

Novel Architectural Elements

Two-stage generation pipeline: Explicit step estimation via similarity search in imagined rollouts followed by constrained in-painting generation.

Modeling

Base Model: Diffusion model (U-Net typically used in Diffuser, though exact architecture not detailed)

Training Method: Supervised learning on offline dataset

Objective Functions:

Purpose: Train the generative model to reconstruct masked states.

Formally: Standard diffusion loss L_diff(θ) = E[|| ε - ε_θ(x_t, t) ||^2].
Purpose: Train inverse dynamics model to predict actions.

Formally: L_inv(ψ) = E[(a_t - f_ψ(s_t, s_{t+1}))^2].
Purpose: Train reward model.

Formally: L_rew(ϕ) = E[(r_t - f_ϕ(s_t, a_t, s_{t+1}))^2].
Purpose: Train dynamics model for qualification.

Formally: L_dyn(ω) = E[|| s_{t+1} - f_ω(s_t, a_t) ||^2].

Training Data:

D4RL datasets (MuJoCo locomotion and AntMaze)

Compute: Not reported in the paper

Comparison to Prior Work

vs. SER: DiffStitch generates full coherent trajectories connecting specific start/end points, whereas SER generates independent transitions.
vs. TATU: DiffStitch is 'dual-way' (connects A to B), while TATU branches from A blindly. DiffStitch uses diffusion for stability over long horizons rather than pure model rollouts.
vs. Cosby [not cited in paper]: Cosby also performs stitching but relies on importance sampling weights rather than generative diffusion augmentation.

Limitations

Computational cost of training multiple models (diffusion, inverse dynamics, reward, qualification) is likely high.
Step estimation relies on the generative model's horizon H; if the gap is larger than H, stitching may fail.
Success depends heavily on the quality of the learned dynamics model used for qualification; a poor model might filter good data or pass bad data.

Reproducibility

Code not provided. Implementation details for the specific diffusion architecture (e.g., specific U-Net config) are sparse, relying on references to prior work like Diffuser.

📊 Experiments & Results

Evaluation Setup

Offline RL benchmark tasks

Benchmarks:

D4RL (Locomotion (Hopper, Walker2d, HalfCheetah) and Navigation (AntMaze))

Metrics:

Normalized Average Return (Score)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DiffStitch consistently improves the performance of various offline RL algorithms (IQL, TD3+BC, DT) across D4RL datasets.
hopper-medium-expert-v2	Normalized Score	98.0	109.1	+11.1
walker2d-medium-replay-v2	Normalized Score	26.7	39.4	+12.7
antmaze-umaze-v0	Normalized Score	89.5	95.3	+5.8
hopper-medium-expert-v2	Normalized Score	107.6	109.2	+1.6
hopper-medium-v2	Normalized Score	78.4	98.7	+20.3
hopper-medium-v2	Normalized Score	91.2	98.7	+7.5

Experiment Figures

Comparison of different data augmentation methods (Original, SER, TATU, DiffStitch) visualized on a 2D maze.

Main Takeaways

DiffStitch serves as a plug-and-play data augmentation module that improves One-step (IQL), Imitation (TD3+BC), and Trajectory Optimization (DT) methods.
The method is particularly effective on 'medium' and 'medium-replay' datasets where optimal trajectories are fragmented or scarce.
Ablations confirm that estimating the correct number of stitching steps is critical; fixed-step stitching leads to out-of-distribution transitions that harm learning.
Filtering generated data via a dynamics model (Qualification Module) is essential to prevent learning from physically unrealistic hallucinations.

📚 Prerequisite Knowledge

Prerequisites

Offline Reinforcement Learning fundamentals
Diffusion Probabilistic Models (for sequence generation)
Inverse Dynamics Models

Key Terms

trajectory stitching: Creating a new trajectory by connecting two separate trajectory segments, allowing an agent to transition from the state of one to the state of another.

IQL: Implicit Q-Learning—an offline RL algorithm that avoids querying out-of-sample actions by treating the value function as a random variable.

TD3+BC: Twin Delayed DDPG with Behavioral Cloning—an offline RL algorithm that adds a behavioral cloning regularization term to the policy update.

DT: Decision Transformer—an offline RL method that frames reinforcement learning as a sequence modeling problem using transformers.

inverse dynamics model: A model that predicts the action taken to transition between two given states (s_t, s_{t+1}).