PHF: Pretraining with Human Feedback—incorporating human preferences (via reward models) directly into the pretraining objective rather than just finetuning.
MLE: Maximum Likelihood Estimation—the standard pretraining objective where the model maximizes the probability of the next token in the training data.
Conditional Training: A technique where control tokens (e.g., <|good|>, <|bad|>) are prepended to text segments based on their reward score, teaching the model to distinguish quality.
Unlikelihood Training: An objective that minimizes the probability of tokens in low-reward segments, effectively teaching the model what *not* to generate.
RWR: Reward-Weighted Regression—an offline RL objective that weights the standard language modeling loss by the exponentiated reward of the segment.
AWR: Advantage-Weighted Regression—an offline RL objective that weights updates by the 'advantage' (reward minus a learned value baseline).
PII: Personally Identifiable Information—sensitive data like phone numbers or email addresses that models should not memorize or generate.
Pareto frontier: The set of optimal trade-offs where no metric can be improved without degrading another (here, alignment vs. capabilities).
PEP8: The standard style guide for Python code, used here as a proxy for code quality preferences.