_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly observe the full state of the environment
BFCL: Berkeley Function Calling Leaderboard—a benchmark evaluating the ability of LLMs to invoke software functions correctly
Tau-bench: A benchmark for evaluating agents in realistic, stateful multi-turn scenarios (e.g., airline, retail)
Pass@1: A metric measuring the percentage of tasks where the model generates the correct solution on its first attempt
SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, specific dataset to adapt it to a particular task
Groundtruth Actions: The verifiable sequence of correct API calls and parameters required to solve a user's request
Reverse Task Recombination: A technique where complex tasks are created by combining simpler API capabilities and then generating a user query that would require those specific combinations
xLAM-2-fc-r: The family of models (ranging from 1B to 70B parameters) trained by the authors using the APIGen-MT dataset
Executability Check: Verifying that generated code or API calls actually run without errors in a simulated environment
Latent State: Information in the environment (e.g., database records) that is not immediately visible to the user or agent until accessed via tools