Offline RL: Reinforcement learning that learns a policy exclusively from a fixed dataset without interacting with the environment during training
Policy Extraction: The process of deriving an actionable policy (actor) from a learned value function (critic)
Value Function: A function estimating the expected future rewards from a given state or state-action pair
AWR: Advantage-Weighted Regression—a policy extraction method that treats RL as supervised learning weighted by the advantage (value)
DDPG+BC: Deep Deterministic Policy Gradient with Behavioral Cloning—a method enabling the policy to improve via gradients from the value function while staying close to the data distribution
IQL: Implicit Q-Learning—a method that learns value functions using expectile regression to avoid querying out-of-distribution actions
Data-scaling matrices: Visualizations showing how performance changes as the amount of data used for value learning vs. policy learning is varied independently
Generalization gap: The difference in performance between states seen in the training dataset (in-distribution) and novel states encountered during evaluation (out-of-distribution)
Test-time training: Updating the model parameters during the evaluation phase (deployment) based on the specific states encountered