LoRA: Low-Rank Adaptation—a PEFT method that injects trainable low-rank matrices into frozen model layers.
Importance Sampling: A statistical technique where samples are drawn from a distribution different from the one of interest to estimate properties more efficiently; here, sampling layers to update based on their expected gradient norm.
Weight Norm: The magnitude (L2 norm) of the weight matrix or its update; used here as a proxy for how 'important' a layer's update is.
Activation Memory: GPU memory used to store intermediate outputs of neural network layers needed for gradient computation during backpropagation.
Gradient Checkpointing: A technique to reduce memory by not saving all intermediate activations, instead recomputing them during the backward pass.
MT-Bench: A benchmark for evaluating the conversation and instruction-following ability of LLMs using GPT-4 as a judge.
GSM8K: Grade School Math 8K—a dataset of high quality linguistically diverse grade school math word problems.
PubMedQA: A biomedical question answering dataset used to test domain-specific knowledge.
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer.
Full-Parameter Training: Fine-tuning where all model parameters are updated.