SLM: Small Language Model—models with fewer parameters (e.g., 1.5B, 7B) deployed on clients
LLM: Large Language Model—models with many parameters (e.g., 70B) deployed on the server
Learnability Gap: The disparity between a teacher model's knowledge complexity and a student model's capacity to absorb it
DPO: Direct Preference Optimization—a method for aligning language models to preferences using a contrastive loss without a separate reward model
UCB: Upper Confidence Bound—an algorithm used in multi-armed bandit problems to balance exploration and exploitation
ExploitNet: A neural network component of the filter that predicts sample rewards based on historical data
ExploreNet: A neural network component of the filter that estimates uncertainty to encourage exploration of new samples
CoT: Chain-of-Thought—a prompting technique that encourages models to generate intermediate reasoning steps
SFT: Supervised Fine-Tuning—training a model on labeled examples