MBRL: Model-Based Reinforcement Learning—RL agents that learn a simulation of the environment (world model) to plan actions in 'imagination'
World Model: A neural network that predicts future states and rewards given current states and actions
DreamerV3: State-of-the-art MBRL algorithm that learns latent dynamics and rewards to train an actor-critic policy purely from imagined trajectories
Reward Smoothing: Applying a filter (like a Gaussian blur or moving average) to the sequence of scalar rewards in the replay buffer before training the reward model
Sparse Rewards: Reward signals that are zero for most timesteps and non-zero only upon completing specific events, making them hard to learn
EMA: Exponential Moving Average—a smoothing technique where the current value is a weighted average of the current observation and previous history
POMDP: Partially Observable Markov Decision Process—an environment where the agent cannot see the full state (e.g., seeing only camera pixels, not object coordinates)
TD-MPC: Temporal Difference Learning for Model Predictive Control—an MBRL algorithm that learns a value function and plans actions using a learned latent model
MBPO: Model-Based Policy Optimization—an algorithm that uses short model-generated rollouts to augment real data for policy training