SA: Self Attention—the standard mechanism in Transformers where tokens aggregate information from other tokens in the sequence
FFN: Feed Forward Network—the position-wise processing block in Transformers that processes each token independently
XSA: Exclusive Self Attention—the proposed method that removes the self-value component from the attention output
Attention Similarity Bias: The tendency of standard attention outputs to have high cosine similarity with the current token's input value vector
RoPE: Rotary Positional Embeddings—a method for encoding position information by rotating the query and key vectors
Attention Sink: The phenomenon where attention heads dump massive weight on specific tokens (like the start token or current token) to discard unnecessary information
NanoGPT: A simple, clean repository for training GPT-style models, used here as the codebase
AdamW: A variation of the Adam optimizer with decoupled weight decay
Value Vector: The vector (v) in attention mechanisms that represents the content information to be aggregated