Teacher Forcing (TF): A training method where the model predicts the next token using the ground truth history rather than its own previous predictions
Offline RL: Reinforcement learning that learns a policy exclusively from a static dataset of previously collected experiences without interacting with the environment
Decision Transformer (DT): An offline RL method that treats RL as a sequence modeling problem by conditioning the generation on a desired return (reward) token
ILQL: Implicit Q-Learning—an off-policy RL algorithm that learns value functions and defines an implicit policy without explicit actor training
TF Top: A simple baseline that fine-tunes a model using teacher forcing only on the subset of data trajectories that achieved high rewards
BERTScore: An automatic evaluation metric that computes semantic similarity between generated text and reference text using contextual embeddings
PPO: Proximal Policy Optimization—an online policy gradient method that updates policies by interacting with the environment
Quark: A method using Decision Transformers with an iterative outer loop to collect new data (online variant)
Expectile Regression: A generalized form of regression used in ILQL to estimate the upper tail of the value distribution, approximating the maximum value