SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, labeled dataset to adapt it to a specific task
RLHF: Reinforcement Learning from Human Feedback—an alignment technique using human preferences to train a reward model and optimize the LLM
Self-Instruct: A method where a language model generates its own instruction-following training data from a small set of seed tasks
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank-decomposition matrices
Context Distillation: Transferring the capabilities of a model prompted with a long context (e.g., rules/instructions) into the model's weights via fine-tuning on the outputs, so the context isn't needed at inference
Principle Engraving: The process in this paper of fine-tuning the base model on its own principle-compliant responses (stripping out the principles/thoughts) to internalize the alignment
Verbose Cloning: A post-processing step using context distillation to make the aligned model generate more detailed/comprehensive answers
Red-Teaming: Testing AI systems with adversarial inputs (e.g., questions about illegal acts) to find failures; here used to generate diverse training topics