Pareto frontier: The set of optimal solutions where no objective can be improved without degrading another (here, task score vs. user effort)
MOO: Multi-Objective Optimization—optimizing for multiple conflicting goals simultaneously
Contextual MDP: A Markov Decision Process where transition and reward functions depend on a hidden context (e.g., user intent)
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for stable training
SFT: Supervised Fine-Tuning—training a model on labeled examples before applying reinforcement learning
warm start: Initializing the policy with SFT before RL training to ensure basic competency and behavioral patterns
UserRL: A benchmark suite for evaluating agent-user interaction, focusing on feedback and policy adaptation
retrospective reasoning: Looking back at interaction history to refine hypotheses and manage memory
prospective planning: Looking forward to schedule actions based on remaining budget and uncertainty