SCM: Structural Causal Model—a mathematical framework used here to generate synthetic datasets with known causal relationships between variables
ICL: In-Context Learning—the ability of a model to learn a task from examples provided in the prompt without updating its weights
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model parameters
Random Forest: An ensemble learning method that operates by constructing a multitude of decision trees; used here as a teacher model during warm-up
CoT: Chain-of-Thought—a prompting strategy that encourages the model to generate intermediate reasoning steps
MMLU: Massive Multitask Language Understanding—a benchmark measuring general world knowledge and reasoning capabilities
cl100k_base: The specific tokenizer vocabulary used by models like GPT-4 and LLaMA-3
z-norm: Z-score normalization—scaling data so it has a mean of 0 and standard deviation of 1
self-consistency: An inference strategy where the model generates multiple reasoning paths or predictions (here via shuffled demonstrations) and takes a majority vote
SFT: Supervised Fine-Tuning—training a model on labeled data; here used in the context of continued pretraining on synthetic tasks
many-shot: A setting where the model is provided with a large number of examples (e.g., hundreds or thousands) in the context window