self-edit: Natural language instructions or synthetic data generated by the model to update its own weights
SFT: Supervised Fine-Tuning—updating model weights by minimizing loss on labeled examples
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of parameters (adapters)
ReSTEM: Rejected Sampling with Expectation-Maximization—an RL algorithm that filters generated samples based on reward and fine-tunes on the successful ones
TTT: Test-Time Training—temporarily updating model weights on the specific input instance before making a prediction
SQuAD: Stanford Question Answering Dataset—a reading comprehension benchmark used here for knowledge incorporation
ARC: Abstraction and Reasoning Corpus—a benchmark for measuring abstract reasoning and generalization in AI
ICL: In-Context Learning—providing examples in the prompt to guide the model without updating weights
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of samples
PPO: Proximal Policy Optimization—an RL algorithm that constrains policy updates to ensure stability