SFT: Supervised Fine-Tuning—training a model on labeled examples of inputs and desired outputs
DPO: Direct Preference Optimization—a method to align language models to preferences by optimizing directly on ranked pairs of outputs without a separate reward model
CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer
HR@K: Hit Rate at K—the proportion of test cases where the target item appears in the top K recommendations
MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker
Tool Agent Library (TAL): A standardized set of reusable, named functions (tools) mined from expert reasoning traces that the agent can call
Cold-start: A scenario where the system has little to no prior data about a user or item
ReAct: Reason+Act—a paradigm where LLMs interleave reasoning traces with executable actions (tool calls)