SLM: Small Language Model—defined in this paper as general-purpose language models with 1 to 8 billion parameters
LLM: Large Language Model—typically transformer-based models with >10 billion parameters
SFT: Supervised Fine-Tuning—training a model on labeled datasets to adapt it to specific instructions or tasks
RLHF: Reinforcement Learning from Human Feedback—aligning a model's outputs with human preferences using reward models
SSM: State Space Model—a sequence modeling architecture (like Mamba) offering linear computational complexity, serving as an alternative to attention mechanisms
MoE: Mixture of Experts—an architecture where only a subset of parameters (experts) are activated for each token, improving efficiency
RoPE: Rotary Positional Embedding—a method for encoding positional information in transformers
GQA: Grouped-Query Attention—an efficiency technique that divides key-value heads into groups to reduce memory usage
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices
CoT: Chain-of-Thought—a prompting strategy that encourages models to generate intermediate reasoning steps
Quantization: Reducing the precision of model weights (e.g., from 16-bit to 4-bit) to decrease memory footprint and increase speed