NAS: Neural Architecture Search—automated techniques to find optimal neural network structures (e.g., removing layers) under constraints
FFN Fusion: A technique merging consecutive Feed-Forward Network layers (created after removing intervening attention layers) into wider, parallelizable layers
TP: Tensor Parallelism—splitting a model's tensors across multiple GPUs to fit large models in memory
CoT: Chain of Thought—intermediate reasoning steps a model generates before producing a final answer
SFT: Supervised Fine-Tuning—training a model on labeled examples
FP8: Floating Point 8—an 8-bit data format used here to accelerate text generation during the reinforcement learning phase
KV-cache: Key-Value cache—storing calculated attention keys/values to speed up decoding
CPT: Continued Pretraining—additional training on a base model before fine-tuning
Puzzle: The specific NAS framework used to compress the Llama 3 models by creating a library of alternative efficient blocks