early exit: A technique where model predictions are generated from intermediate layers rather than the final layer
contrastive decoding: A decoding strategy that subtracts the log-probabilities of an 'amateur' (weaker) model from an 'expert' (stronger) model to penalize common errors
logits: The raw, unnormalized scores output by the final layer of a neural network before the softmax function converts them to probabilities
DoLa: Decoding by Contrasting Layers—a baseline method that uses early exit layers as the amateur model for contrastive decoding
Jensen-Shannon divergence: A method of measuring the similarity between two probability distributions
MC score: Multiple Choice score—used here to measure accuracy on QA benchmarks like TruthfulQA-MC
SLEB: Streamlining LLMs through Redundancy Verification—a pruning method used here to filter candidate layers based on perplexity impact
batched inference: Processing multiple inputs (or model configurations) simultaneously in one GPU operation to save time
informativeness: A metric defined in the paper measuring the overlap between the top-k tokens of the amateur and expert models
flatness: A property of probability distributions measured by entropy; high flatness means the distribution is near-uniform/uncertain