Fisher Information Matrix: A matrix measuring the amount of information that observable random variables carry about an unknown parameter; equivalent to the Hessian of the negative log-likelihood
Pre-logit layer: The final layer of a neural network before the softmax activation, producing the embeddings used to predict the next token
Hessian: A square matrix of second-order partial derivatives of a scalar-valued function (here, the loss function), describing the local curvature
Log-determinant: The natural logarithm of the determinant of a matrix; maximizing this (D-optimality) corresponds to minimizing the volume of the confidence ellipsoid of parameter estimates
Submodularity: A property of set functions where the marginal gain of adding an element decreases as the set grows (diminishing returns), allowing for efficient greedy optimization
Optimal Design: A field of statistics concerned with selecting data points (experiments) to minimize the variance of parameter estimates
SFT: Supervised Fine-Tuning—adapting a pre-trained model to a specific domain using labeled examples
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable low-rank decomposition matrices