DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing

📝 Paper Summary

Model-Based Reinforcement Learning (MBRL) Reward Modeling Sparse Rewards

DreamSmooth simplifies reward prediction in model-based RL by training the reward model on temporally smoothed rewards, making it easier to detect sparse or ambiguous signals.

Core Problem

In model-based RL, predicting exact sparse rewards at precise timesteps is extremely difficult due to ambiguity and partial observability, causing models to collapse and predict zero reward.

Why it matters:

Current state-of-the-art MBRL algorithms like DreamerV3 fail on tasks with sparse rewards because the reward model misses the signal entirely
Without an accurate reward model, the critic cannot learn properly, and the agent's policy fails to optimize for the actual task objective
Predicting the exact millisecond a reward occurs is often unnecessary for successful planning; a rough estimate is sufficient

Concrete Example: In the 'RoboDesk' task, a large reward is given only when a block touches a bin. This moment is visually ambiguous and spans only a single timestep. Standard reward models fail to predict this spike, outputting near-zero values, so the agent never learns to push the block into the bin.

Key Novelty

Temporal Reward Smoothing for MBRL

Instead of training the reward model to predict the exact reward $r_t$ at step $t$, train it to predict a temporally smoothed version (e.g., Gaussian blur over neighboring steps)
This relaxes the learning objective: the model only needs to predict *roughly when* a reward occurs, rather than the exact ambiguous timestep, preventing model collapse on sparse signals

Architecture

Illustration of three smoothing functions (Gaussian, Uniform, EMA) applied to a sparse reward signal

Evaluation Highlights

Achieves near 100% task completion on sparse-reward RoboDesk and Hand tasks where baseline DreamerV3 fails completely (0% success)
Improves reward prediction accuracy in Crafter (15/19 achievements predicted better), though this does not always translate to higher game scores
Maintains performance on standard dense-reward benchmarks (DeepMind Control Suite, Atari) without degradation

Breakthrough Assessment

7/10

Simple, highly effective fix for a specific but common failure mode (sparse/ambiguous rewards) in MBRL. While not a new architecture, it unlocks performance on tasks where SOTA methods previously failed.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) where the agent learns a world model and reward model from experience to plan actions

Inputs: Sequence of observations $o_{\leq t}$ and actions $a_{< t}$

Outputs: Policy $\pi(a_t | o_{\leq t}, a_{< t})$ maximizing expected return

Pipeline Flow

Experience Collection (Agent interacts with environment)
Reward Smoothing (Pre-processing rewards in buffer)
World Model Training (Learning dynamics and smoothed rewards)
Policy Learning (Training actor/critic in imagination)

System Modules

Experience Collector

Interacts with the environment to collect trajectories $(o_t, a_t, r_t)$

Model or implementation: Environment Interface

Reward Smoother

Applies temporal smoothing to the reward sequence before storage/training

Model or implementation: Algorithmic Function (Gaussian, Uniform, or EMA)

World Model Learner

Learns to predict latent dynamics and the *smoothed* rewards

Model or implementation: DreamerV3 World Model (RSSM-based)

Policy Optimizer

Optimizes policy using values derived from predicted smoothed rewards

Model or implementation: Actor-Critic (DreamerV3 default)

Novel Architectural Elements

Decoupling of ground-truth rewards from training targets: The reward model is explicitly trained on a processed (smoothed) signal rather than the raw environmental signal to ease optimization difficulty

Modeling

Base Model: DreamerV3 (Small, Large, XL variants depending on task)

Training Method: World Model Learning + Actor-Critic

Objective Functions:

Purpose: Train reward model to predict smoothed rewards.

Formally: Mean Squared Error (or SymLog loss) between model prediction $R_\theta(z_t)$ and smoothed reward $\tilde{r}_t$.

Key Hyperparameters:

sigma: 3.0 (Gaussian smoothing standard deviation)
alpha: 0.3 (EMA smoothing factor)
delta: 9 (Uniform smoothing window size)
+ 3 more
batch_size: 16 (sequences)
sequence_length: 64
optimizer: Adam

Compute: Single GPU (NVIDIA A5000, V100, RTX Titan, or RTX 2080). Training time ranges from 6 hours (Atari/DMC) to 150 hours (Earthmoving).

Comparison to Prior Work

vs. DreamerV3: Identical architecture, but trains reward model on smoothed targets. DreamerV3 fails on sparse/ambiguous rewards; DreamSmooth succeeds.
vs. Reward Shaping [not cited in paper]: Reward shaping typically involves adding domain-specific potentials based on state. DreamSmooth is domain-agnostic temporal smoothing of the scalar reward signal itself.
vs. Value Function Smoothing [not cited in paper]: Value functions smooth rewards over time via discounting. DreamSmooth smooths the *immediate reward target* itself to aid the supervised learning of the reward predictor.

Limitations

Can degrade performance in environments with many dense reward sources (e.g., Crafter) due to false positive predictions from symmetric smoothing kernels
Smoothing with future rewards (Gaussian/Uniform) theoretically breaks optimality guarantees as rewards become policy-dependent (though works empirically)
Requires selecting a smoothing kernel and hyperparameter (sigma/alpha), though results appear relatively robust to these choices

Reproducibility

Code: https://github.com/vintlee/dreamsmooth

publicly available (https://github.com/vintlee/dreamsmooth). Implementation is extremely simple (1 line of code added to buffer storage). Hyperparameters provided for all tasks.

📊 Experiments & Results

Evaluation Setup

Sparse reward tasks requiring sequential manipulation or navigation, plus standard dense reward benchmarks

Benchmarks:

RoboDesk (Multi-stage manipulation (sparse reward))
Shadow Hand (Dexterous manipulation (sparse reward))
Earthmoving (Navigation and manipulation with granular media (sparse reward))
Crafter (Open-world survival (sparse reward))
DeepMind Control Suite (DMC) (Continuous control (dense reward))
Atari 100k (Arcade games (dense/sparse mix))

Metrics:

Tasks Completed
Success Rate
Episode Return (Score)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DreamSmooth enables learning on sparse reward tasks where the baseline DreamerV3 completely fails.
RoboDesk	Tasks Completed	0	3	+3
Shadow Hand	Tasks Completed	0	3	+3
Earthmoving	Number of Rocks Dumped	0	2	+2
DreamSmooth maintains performance on dense reward benchmarks, showing it is a safe default.
DeepMind Control Suite	Score (x1000)	0.75	0.75	0.00
Atari 100k	Human Normalized Score	0.35	0.35	0.00

Experiment Figures

Comparison of Ground Truth rewards vs. DreamerV3 predicted rewards on sparse tasks

Comparison of Ground Truth, Smoothed Rewards, and DreamSmooth Predicted Rewards

Main Takeaways

Reward smoothing fixes the 'bottleneck' of reward prediction in MBRL: simple smoothing allows models to detect sparse rewards that were previously ignored.
The method is robust to hyperparameter choices (sigma, alpha) within a reasonable range.
Gaussian (symmetric) smoothing generally works best for manipulation, but EMA (causal) smoothing is safer for environments like Crafter where predicting future rewards might cause false positives.
Increasing model size or oversampling sparse rewards helps slightly but is far less effective than simple reward smoothing.

📚 Prerequisite Knowledge

Prerequisites

Model-Based Reinforcement Learning (MBRL)
World Models (Latent Dynamics Models)
Reward Functions and Sparse Rewards

Key Terms

MBRL: Model-Based Reinforcement Learning—RL agents that learn a simulation of the environment (world model) to plan actions in 'imagination'

World Model: A neural network that predicts future states and rewards given current states and actions

DreamerV3: State-of-the-art MBRL algorithm that learns latent dynamics and rewards to train an actor-critic policy purely from imagined trajectories

Reward Smoothing: Applying a filter (like a Gaussian blur or moving average) to the sequence of scalar rewards in the replay buffer before training the reward model

Sparse Rewards: Reward signals that are zero for most timesteps and non-zero only upon completing specific events, making them hard to learn

EMA: Exponential Moving Average—a smoothing technique where the current value is a weighted average of the current observation and previous history

POMDP: Partially Observable Markov Decision Process—an environment where the agent cannot see the full state (e.g., seeing only camera pixels, not object coordinates)

TD-MPC: Temporal Difference Learning for Model Predictive Control—an MBRL algorithm that learns a value function and plans actions using a learned latent model

MBPO: Model-Based Policy Optimization—an algorithm that uses short model-generated rollouts to augment real data for policy training