Dual-Stream Decomposition: Splitting the residual stream into $\mathbf{x}_t$ (token info, updated by attention) and $\mathbf{x}_e$ (context info, updated by FFNs)
Channelized Mixing: Restricting the linear projections in attention heads to control information flow between heads (e.g., Block-diagonal or Kronecker)
Kronecker Mixing: A projection strategy using $W_{heads} \otimes I$, allowing scalar communication between heads while preserving within-head vector structure
Attention Amplification: A diagnostic technique where attention logits are scaled by a factor $\alpha > 1$ before softmax to test if the model relies on discrete selection or soft mixing
Frozen-Token-Stream (FTS): A configuration where the token stream $\mathbf{x}_t$ is fixed to the initial embeddings and never updated, forcing all processing into the context stream
CLN: Channel-Aware Layer Normalization—normalizes each head's dimensions independently to preserve head isolation