Model-Based RL: RL methods that learn a model of the environment's dynamics (transitions and rewards) and often use it for planning
Model-Free RL: RL methods that learn a policy or value function directly from experience without explicitly modeling the environment's physics
TD3: Twin Delayed DDPG—a standard model-free algorithm for continuous control that uses two critics to reduce overestimation bias
DreamerV3: A state-of-the-art general-purpose model-based RL algorithm that learns a world model from pixels
TD-MPC2: A general-purpose model-based algorithm that uses temporal difference learning for model predictive control
Bisimulation: A mathematical concept where two states are considered equivalent if they have the same immediate reward and transition to equivalent next states
Two-hot encoding: A categorical representation of a scalar value (like reward) using probability mass distributed between the two nearest discrete bins
Huber loss: A loss function that is quadratic for small errors and linear for large errors, less sensitive to outliers than Mean Squared Error
LAP: Loss-Adjusted Prioritized Experience Replay—a method to sample training data based on the magnitude of the TD error
EMA: Exponential Moving Average—a technique to update target network parameters slowly over time to stabilize training