CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
LLM-as-judge: Using a Large Language Model to evaluate the quality or correctness of outputs from another model
Gumbel-Softmax: A method to approximate sampling from a categorical distribution in a differentiable way, allowing backpropagation through discrete choices
QLoRA: Quantized Low-Rank Adaptation—a memory-efficient fine-tuning technique for large language models
SFT: Supervised Fine-Tuning—training a model on labeled examples
ToT: Tree-of-Thought—a prompting method that explores multiple reasoning paths
RoBERTa: A robustly optimized BERT pretraining approach, used here as an encoder for routing
SKD: Symbolic Knowledge Distillation—a baseline method training on teacher-generated CoTs
Entropy regularization: Adding a term to the loss function that encourages the probability distribution (here, routing decisions) to be more spread out, preventing collapse to a single option