pass^k: Pass hat k—a metric measuring consistency, defined as the probability that an agent succeeds in ALL k independent trials of the same task.
pass@k: Pass at k—a standard metric measuring the probability that an agent succeeds in AT LEAST ONE of k trials.
POMDP: Partially Observable Markov Decision Process—a mathematical framework for decision-making where the agent cannot directly see the full state of the world (here, the user's hidden intent).
User Simulator: An LLM (GPT-4) prompted to act as a human user, holding a specific goal/instruction hidden from the agent, used to generate dynamic responses.
SOTA: State-of-the-Art—the current best performing models or methods.
Function Calling: A capability of LLMs to generate structured outputs (like JSON) effectively invoking external tools or APIs.
ReAct: Reasoning and Acting—a prompting method where the model generates a thought trace before taking an action.
Domain Policy: A textual document provided to the agent describing rules, constraints, and procedures (e.g., 'Basic economy cannot be modified').