MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker
Semantic Variable: Key variables in a program (like LLM inputs/outputs) that represent critical intent or state, excluding auxiliary code like loop counters
AIR: Automatic Intermediate Rewarding—a mechanism to assign partial rewards to intermediate steps (e.g., successful API call) based on system signals
Credit Assignment: The problem of determining which past actions contributed to a final reward; handled here by decomposing trajectories into individual transitions
Observability Framework: Tools (like OpenTelemetry) used to monitor software performance; here repurposed to collect training data from agent execution traces
RAG: Retrieval-Augmented Generation—agents that fetch external data to answer queries
PPO: Proximal Policy Optimization—an RL algorithm used here to update the policy LLM
MCP: Model Context Protocol—a standard for connecting AI assistants to systems/tools
POMDP: Partially Observable Markov Decision Process—an extension of MDP where the agent cannot directly observe the full underlying state
TA Disaggregation: Training-Agent Disaggregation—an architectural pattern separating the agent runtime from the model training service