CoT: Chain-of-Thought—a reasoning process where the model generates intermediate steps before the final answer
KV Cache: Key-Value Cache—stores intermediate attention computations to speed up autoregressive generation
W8A8: Quantization configuration with 8-bit Weights and 8-bit Activations
AWQ: Activation-aware Weight Quantization—a method that protects salient weights based on activation magnitude
GPTQ: Generative Pre-trained Transformer Quantization—a layer-wise quantization method using second-order information
SmoothQuant: A method that migrates quantization difficulty from activations to weights by smoothing activation outliers
FlatQuant: A state-of-the-art quantization method optimized for low-bit weight-activation scenarios
QuaRot: A quantization method using rotation matrices to suppress outliers in weights and activations
AIME-120: A difficult math benchmark consisting of 120 problems from the American Invitational Mathematics Examination
RL-based reasoning: Models that learn reasoning via Reinforcement Learning (e.g., QwQ) rather than just supervised distillation