TARL: Turn-level Adjudicated Reinforcement Learning—the proposed method using an LLM judge to score individual conversation turns
MCP: Model Context Protocol—a standard for connecting AI assistants to systems and data (used here for tool integration)
ReACT: Reasoning and Acting—a paradigm where agents generate reasoning traces before executing actions
PPO: Proximal Policy Optimization—an RL algorithm that limits how much a policy can change in one step to ensure stability
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled trajectories to reduce variance
RLOO: REINFORCE Leave-One-Out—an RL algorithm using peer samples as a baseline to reduce gradient variance
CoT: Chain-of-Thought—a prompting technique encouraging models to show step-by-step reasoning
GAE: Generalized Advantage Estimate—a method to estimate the 'advantage' of an action (how much better it is than average) by balancing bias and variance
SFT: Supervised Fine-Tuning—training on labeled data before RL
SeedTTS: A speech generation model used here to convert simulated user text responses into audio