GEM: General-purpose Extraction of Multi-turn trajectories—the proposed pipeline for synthesizing agent data from text.
BFCL: Berkeley Function-Calling Leaderboard—a benchmark for evaluating LLM tool-use capabilities.
Tau-bench: A benchmark evaluating agents in realistic, complex domains like Airline and Retail with user simulators.
SFT: Supervised Fine-Tuning—training a model on labeled examples to adapt it to a specific task.
SFT warm-start: Initial training using supervised data before applying other optimization techniques (though primarily SFT is used here).
Trajectory Synthesizer: A specialized model trained to convert text directly into tool-use trajectories, bypassing the multi-step pipeline during inference.
UltraFineWeb: A large-scale, high-quality open web dataset used as the source corpus for text segments.
GLM-4: A strong Large Language Model used as the 'teacher' to generate initial synthetic data.
Hallucination: When a model generates content (like tool parameters) not supported by the context or facts.