PbRL: Preference-based Reinforcement Learning—learning a policy using feedback (preferences) rather than a pre-defined reward signal.
Credit Assignment Problem: The challenge of determining which specific actions or states in a sequence are responsible for the final outcome (reward or preference).
TWM: Transformer-based World Model—a specific architecture used to model environment dynamics using attention mechanisms.
Hindsight: Analyzing events after they have occurred; here, determining which states were important for a trajectory after observing the full sequence.
Bradley-Terry model: A statistical model used to predict the probability that one item is preferred over another based on their underlying values (returns).
Return Redistribution: A technique to re-allocate the total return of a trajectory to its constituent steps based on some criteria (here, attention weights).
PEBBLE: A baseline PbRL algorithm that uses unsupervised pre-training and off-policy learning.
MetaWorld: A benchmark suite of robotic manipulation tasks.
DMC: DeepMind Control Suite—a set of physics-based simulation tasks for RL.