Meta-RL: Meta-Reinforcement Learning—learning a policy that can adapt to new tasks or correct itself rapidly by leveraging interaction history within the context
RLOO: Reinforce Leave-One-Out—an estimator for policy gradients that uses the average reward of other samples in a batch as a baseline to reduce variance
ReAct: Reasoning + Acting—a prompting paradigm where LLMs generate reasoning traces followed by tool actions
Sparse rewards: Feedback signals provided only at the very end of a task (e.g., correct/incorrect), lacking intermediate guidance
PPO: Proximal Policy Optimization—an RL algorithm that constraints policy updates to prevent instability
Meta-episode: A sequence of standard episodes (interaction trajectories) where each subsequent episode is conditioned on the history and reflections of the previous ones
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same input
Self-Reflection: The process where an agent analyzes its previous output to identify errors before attempting the task again