GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled trajectories to stabilize training
Intrinsic Feedback: Reward signals generated internally by the agent (e.g., curiosity, self-assessment) rather than provided by the environment
Hindsight Reflection: The process of analyzing a completed trajectory to derive lessons or evaluate performance after the fact
UCB: Upper Confidence Bound—an algorithm used to balance exploration (trying new things) and exploitation (using known good things) by adding an uncertainty bonus to the estimated value
SimUtil-UCB: Similarity & Utility-Aware Upper Confidence Bound—the paper's proposed retrieval strategy balancing semantic relevance, historical usefulness, and exploration
REINFORCE: A fundamental policy gradient algorithm in reinforcement learning that updates policy parameters proportional to the return
Extrinsic Reward: The standard reward signal provided by the environment (e.g., +1 for success, 0 for failure)