LLM-based Agents: Systems that use LLMs as a core controller to plan, maintain memory, and execute actions via tools in an environment
ReAct: Reasoning and Acting—a paradigm where agents generate reasoning traces and task-specific actions in an interleaved manner
Function Calling: The ability of an LLM to generate structured outputs (like JSON) to invoke external APIs or tools
SFT: Supervised Fine-Tuning—training models on labeled examples to improve specific behaviors
MMLU: Massive Multitask Language Understanding—a standard benchmark for general LLM knowledge, noted here as insufficient for agent evaluation
GSM8K: Grade School Math 8K—a benchmark for multi-step mathematical reasoning
BFCL: Berkeley Function Calling Leaderboard—a benchmark specifically for evaluating tool-use capabilities
Gym-like Environments: Interactive simulation platforms (inspired by OpenAI Gym) where agents take actions and receive observations/rewards, used for dynamic evaluation
Context Window: The limit on the amount of text (tokens) an LLM can process at once; critical for agents maintaining long-term memory
SoTA: State-of-the-Art—the current best performance achieved by any system
Trajectory: The sequence of actions, observations, and reasoning steps an agent takes to solve a problem