Parity function: A function that outputs 1 if the sum of selected binary inputs is odd, and 0 otherwise (equivalent to XOR sum)
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
Sample complexity: The number of training examples required for a model to learn a target function to a specific accuracy
Sparse dependence: A property where the next token in a sequence depends only on a small subset of previous tokens
Attention sparsity: A state where attention weights are concentrated on a few specific tokens (one-hot or near one-hot) rather than distributed uniformly
Secret set: The subset of input indices that actually determine the output of the parity function; finding these is the core learning challenge
Hinge loss: A loss function used for classification (often in SVMs) defined as max(0, 1 - y*y_pred)
Densenet structure: A modification to residual connections where layers are concatenated rather than added, preserving representation power while simplifying analysis