Coverage Profile: The probability that the ratio of the data distribution probability to the model probability is less than a threshold N (equivalent to the CDF of the log density ratio)
Best-of-N: A sampling strategy where N responses are generated and the one with the highest reward is selected
Pass@N: The probability that at least one correct response is generated within N attempts
Autoregressive Linear Model: A simplified theoretical model where the log-probability of a token is linear in a fixed feature map of the history
Sequence-level Cross-Entropy: The total cross-entropy summed over all tokens in a sequence; typically scales linearly with sequence length H
Missing Mass: The phenomenon where a model assigns zero or near-zero probability to valid responses, potentially causing infinite KL divergence
Test-Time Training (TTT): Updating model parameters on-the-fly during inference using the prompt or generated tokens
Inherent Variance: A variance term capturing the number of 'pivotal' tokens in a sequence that have high entropy, acting as an effective sequence length