GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs generated from the same input, avoiding the need for a separate value network
ReAct: Reason+Act—a paradigm where agents generate reasoning traces and task-specific actions in an interleaved manner
RAG: Retrieval-Augmented Generation—enhancing model responses by retrieving relevant documents from an external source
F1 score: A metric measuring the overlap between the predicted answer and the ground truth, balancing precision and recall
EM: Exact Match—a strict metric requiring the predicted answer to be identical to the ground truth
Conditional Probability: The probability of an event occurring given that another event has occurred; here, the probability of the correct answer given the memory content
Credit Assignment: The problem in RL of determining which past action is responsible for a current reward
Trajectory: The sequence of states and actions taken by an agent from the start of a task to its completion
OOD: Out-of-Distribution—data that is different from what the model saw during training