marginal prediction: Predicting the target based only on the base input B, ignoring the selector z, resulting in a probability distribution over all K possible targets
conditional prediction: Predicting the specific target A using both the base B and the selector z, achieving zero entropy
entropic force: A statistical tendency for a system to remain in high-entropy states (broad basins) because there are more microstates there, effectively opposing movement to sharper, lower-entropy minima
grokking: A phenomenon where a model suddenly learns to generalize long after achieving near-zero training error (or in this case, long after plateauing)
saddle point: A point in the loss landscape with zero gradient that is a minimum in some directions but a maximum in others
directional asymmetry: The phenomenon where a model can learn a mapping X->Y easily but struggles to learn Y->X (or in this context, learns the marginal faster than the conditional)
reversal curse: A specific failure mode where LLMs trained on 'A is B' cannot answer 'Who is B?'