SFT: Supervised Fine-Tuning—training a model to maximize the likelihood of the next token in a provided dataset
iw-SFT: Importance Weighted SFT—the proposed method which weights SFT examples based on the ratio of the current policy's probability to the reference policy's probability
SFT(Q): SFT from quality sampled data—a variant where data is sampled proportional to quality scores (e.g., star ratings)
RL: Reinforcement Learning—training an agent to maximize cumulative rewards through trial and error
BC: Behavior Cloning—a form of imitation learning where a policy is trained to mimic an expert's actions (equivalent to SFT)
RWR: Reward Weighted Regression—an RL algorithm that weights training examples by their rewards
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution
sparse reward: A setting where the agent receives non-zero feedback only rarely, often just binary success/failure at the end of a task
PPO: Proximal Policy Optimization—a popular RL algorithm that uses a clipped objective to ensure stable updates
DPO: Direct Preference Optimization—a method to align language models to preferences without explicit reward modeling
IQL: Implicit Q-Learning—an offline RL algorithm
AWAC: Advantage Weighted Actor Critic—an offline RL algorithm