Encoder-only: A transformer model type (like BERT) that processes text bi-directionally to understand context, primarily used for classification and retrieval rather than text generation
RAG: Retrieval-Augmented Generation—systems that retrieve relevant documents to help an LLM answer questions
RoPE: Rotary Positional Embeddings—a method for encoding token order that generalizes better to long sequences than fixed position embeddings
GeGLU: Gated Linear Unit with GELU—an activation function variation that offers better performance than standard GELU by adding a learnable gate
Unpadding: An efficiency technique that removes padding tokens from batches, processing all valid tokens as a single concatenated sequence to avoid wasted compute
Flash Attention: A memory-efficient algorithm for computing attention that minimizes memory reads/writes, speeding up training and inference
MLM: Masked Language Modeling—a training objective where the model predicts randomly hidden words in a sentence
BPE: Byte Pair Encoding—a tokenization method that breaks text into common subword units
Deep & Narrow: A design choice preferring more layers with smaller hidden dimensions over fewer, wider layers, which can improve performance-per-parameter