MFU: Model Flops Utilization—the ratio of the actual floating-point operations performed by the model per second to the theoretical peak performance of the hardware
DLRM: Deep Learning Recommendation Model—a standard architecture for recommendation combining embedding layers for categorical features and dense layers for numerical features
Token Mixing: A mechanism to mix information across different tokens (features) without using pair-wise attention scores, often using simple projection or shuffling
MoE: Mixture-of-Experts—a neural network architecture where different subsets of the network (experts) are activated for different inputs to increase capacity without increasing inference cost
QPS: Queries Per Second—a measure of the throughput of a serving system
PFFN: Per-Token Feed-Forward Network—a design where each token position has its own independent set of FFN parameters, rather than sharing weights across all tokens
ROI: Return on Investment—in this context, the performance gain achieved per unit of additional computational cost or latency
Quantization: The process of mapping input values from a large set (like 32-bit floats) to output values in a smaller set (like 8-bit integers) to reduce model size and speed up computation