Offline RL: Reinforcement learning that learns a policy from a fixed dataset without interacting with the environment
TD learning: Temporal Difference learning—a method to estimate value functions by bootstrapping from current estimates
Horizon: The number of steps required to reach a goal or the length of the decision-making sequence
n-step returns: Calculating returns by summing rewards over n steps before bootstrapping, reducing the number of bootstrapping steps (and thus bias accumulation)
Goal-conditioned RL: RL where the agent must learn to reach various goal states specified as input
Hierarchical RL: Decomposing a complex task into high-level subgoals and low-level actions to simplify learning
SHARSA: The proposed method; combines n-step SARSA (for value learning) with hierarchical behavioral cloning (for policy learning)
Bias accumulation: The compounding of small errors in value estimation at each step of TD learning, which grows with the horizon length
IQL: Implicit Q-Learning—an offline RL method that avoids querying out-of-sample actions by using expectile regression
SAC+BC: Soft Actor-Critic with Behavioral Cloning regularization—a standard offline RL baseline
CRL: Contrastive RL—a method using contrastive learning for goal-conditioned tasks
Flow BC: Flow Behavioral Cloning—a conditional generative modeling approach for policy learning