LoRA: Low-Rank Adaptation—a method freezing pre-trained weights and training rank-decomposition matrices to approximate weight updates
intrinsic rank: The hypothesis that over-parametrized models reside on a low intrinsic dimension, meaning effective weight updates can be represented by low-rank matrices
rank-deficiency: A property where a matrix's rank is lower than its dimensions, suggesting redundant information
catastrophic forgetting: The tendency of a neural network to completely and abruptly forget previously learned information upon learning new information
inference latency: The time delay between sending a request to a model and receiving the response
adapter layers: Small neural network modules inserted between layers of a pre-trained model to allow efficient fine-tuning
prefix tuning: A method that optimizes a sequence of continuous task-specific vectors (prefixes) prepended to the input, keeping the model frozen
VRAM: Video Random Access Memory—memory on the GPU used to store model parameters, gradients, and optimizer states during training