Fast Thinking: Situation where the agent can directly generate the correct action without internal monologue or external help
Slow Thinking: Situation where the agent initially predicts incorrectly but can correct itself through self-reflection
Knowledgeable Thinking: Situation where the agent fails even after reflection and requires external knowledge to proceed
RPO: Relative Preference Optimization—a loss function combining DPO with a negative log-likelihood (NLL) term to stabilize training in narrow action spaces
DPO: Direct Preference Optimization—an algorithm aligning language models to preferences without a separate reward model
SFT: Supervised Fine-Tuning—training a model on labeled examples
POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot see the entire state of the environment
ReAct: Reasoning and Acting—a paradigm where agents generate reasoning traces before executing actions
ExpeL: Experience Learning—a baseline method where agents learn from past experiences/trajectories
Pattern Collapse: A failure mode where a model blindly follows learned sequences (patterns) rather than reasoning about the specific context
Scaling Law: The observation that model performance typically improves as model size, data size, or compute increases