xLAM: A Family of Large Action Models to Empower AI Agent Systems

📝 Paper Summary

Multi-call tool use with fixed plan Tool-use post-training Agentic data synthesis

xLAM is a family of open-source Large Action Models optimized for agent tasks via a unified data pipeline that synthesizes verifiable function-calling data and standardizes diverse agent trajectories.

Core Problem

Open-source agent models lag behind proprietary ones because existing agent datasets are scarce, heterogeneous in format, and often contain low-quality or hallucinated actions.

Why it matters:

Proprietary models (like GPT-4) dominate agent tasks, limiting accessibility and transparency for the open-source community
Existing open-source datasets suffer from format inconsistency, making it difficult to unify data or transfer knowledge across different agent environments
Data quality issues, such as invalid function calls and hallucinations, severely hamper the reliability of smaller open-source models in practical applications

Concrete Example: In function calling, a model might hallucinate an argument not present in the user query (e.g., inventing a date) or generate a function name that doesn't exist in the provided tool list. Existing datasets often lack the execution-based verification needed to catch these errors before training.

Key Novelty

Unified Data Pipeline & APIGen Synthesis for Action Models

Standardizes diverse agent datasets into a single unified format (task instruction, tools, few-shot, query, steps) to facilitate effective multi-task training
Employs APIGen, a synthesis framework that generates verifiable function-calling data by executing APIs to filter out hallucinations and invalid arguments
Releases a family of models (1B to 8x22B) specialized for actions, including 'tiny' models (1B) that outperform much larger general-purpose models on function calling

Architecture

The complete data processing, training, and evaluation pipeline for xLAM

Evaluation Highlights

xLAM-8x22b-r achieves #1 rank on Berkeley Function-Calling Leaderboard (87.31% accuracy), outperforming GPT-4-0125-Preview (85.79%)
xLAM-1b-fc-r (1B params) achieves 75.43% on Berkeley Function-Calling Leaderboard, surpassing GPT-3.5-Turbo (75.41%) and Claude-3-Haiku
xLAM-7b-r achieves highest Success Rate (0.414) on Webshop, outperforming GPT-4-0125-preview (0.375) and AgentOhana-8x7b (0.331)

Breakthrough Assessment

8/10

Strong contribution to open-source agent capabilities. The 1B model's performance on function calling is particularly impressive, proving the value of their data synthesis pipeline. Top-1 ranking on BFCL validates the approach.

⚙️ Technical Details

Problem Definition

Setting: Autonomous agent tasks involving multi-turn interaction, reasoning, and tool use (function calling)

Inputs: User query, list of available tools (APIs), and interaction history

Outputs: Action to execute (e.g., function call with arguments) or final textual response

Pipeline Flow

Data Unification (standardizes diverse formats)
Data Augmentation (shuffling, rephrasing)
Data Synthesis (APIGen for verified function calls)
Model Training (SFT + DPO)

System Modules

Data Unifier (Data Processing)

Converts diverse agent datasets into a standard JSON structure with task instruction, tools, and steps

Model or implementation: Rule-based processing

Data Synthesizer (Data Processing)

Generates high-quality, verifiable function-calling data

Model or implementation: DeepSeek-V2-Chat / Mixtral-8x22B-Inst (for generation)

xLAM Agent Model

Predicts actions or responses based on user query and tools

Model or implementation: xLAM-8x22b-r (and other sizes)

Novel Architectural Elements

Unified function-calling style data format applied across diverse agent tasks (reasoning, web navigation, etc.), treating all interactions as potential function calls
Integration of execution-verified synthetic data (APIGen) specifically for 'tiny' models (1B) to achieve high function-calling performance

Modeling

Base Model: Varies by size: DeepSeek-Coder-1b/7b for FC models; Mistral-7b, Mistral-8x7b, Mistral-8x22b for general models

Training Method: Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Minimize difference between predicted and target tokens.

Formally: Standard Cross-Entropy Loss for SFT.
Purpose: Align model with preferred outputs over rejected ones.

Formally: DPO loss L_DPO = -E[log σ(β * (log(π_θ(yw|x)/π_ref(yw|x)) - log(π_θ(yl|x)/π_ref(yl|x))))]

Adaptation: Full fine-tuning for smaller models; LoRA for xLAM-8x22b-r and all DPO stages

Trainable Parameters: Varies (1B to 141B total params)

Training Data:

Mixture of: cleaned/augmented open-source agent datasets
60k synthetic function-calling samples (APIGen)
General instruction-tuning datasets (DialogStudio, Data Provenance)
Instruction data comprises 20-30% of training set

Key Hyperparameters:

scheduler: Cosine learning rate scheduler
warmup_steps: 100
framework: PyTorch FSDP

Compute: Trained on Nvidia H100 GPUs

Comparison to Prior Work

vs. ToolLLM: xLAM uses execution-based verification for synthetic data, reducing hallucination compared to ToolLLM's consensus filtering
vs. AgentOhana: xLAM introduces a more modular unified format and scales to larger MoE models (8x22B)
vs. GPT-4 [not cited in paper]: xLAM achieves better function-calling accuracy on BFCL via specialized fine-tuning, despite being smaller/open-weight
+ 1 more
vs. NexusRaven [not cited in paper]: xLAM targets general agent capabilities (web, reasoning) in addition to function calling, whereas NexusRaven focuses primarily on function calling

Limitations

xLAM-8x22b-r SFT uses LoRA instead of full fine-tuning due to compute constraints
Benchmark evaluation limited to 4 suites (Webshop, ToolQuery, ToolBench, BFCL) due to budget/stability
BFCL v2 live data was released after model training, though models still generalized well
Qualitative analysis suggests some data still suffers from low-quality reasoning steps despite filtering

Reproducibility

Code: https://github.com/SalesforceAIResearch/xLAM

publicly available (https://github.com/SalesforceAIResearch/xLAM). Models released on HuggingFace (huggingface.co/Salesforce/xLAM-models). Synthetic dataset generation pipeline (APIGen) described and cited. Training code based on Transformers/Accelerate libraries. Specific training hyperparameters (learning rate, batch size) not explicitly detailed in text.

📊 Experiments & Results

Evaluation Setup

Evaluation across web navigation, tool use, and function calling benchmarks

Benchmarks:

Webshop (Interactive web navigation and shopping)
ToolQuery (Tool-augmented question answering)
ToolBench (Multi-turn reasoning and tool usage)
Berkeley Function-Calling Leaderboard (BFCL) v2 (Function calling (AST accuracy, executable accuracy))

Metrics:

Success Rate (Webshop, ToolQuery)
Pass Rate (ToolBench)
Overall Accuracy (BFCL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
BFCL v2 performance shows xLAM models dominating the leaderboard, with the largest model taking 1st place and the 1B model punching significantly above its weight.
Berkeley Function-Calling Leaderboard v2	Overall Accuracy	85.79	87.31	+1.52
Berkeley Function-Calling Leaderboard v2	Overall Accuracy	75.41	75.43	+0.02
Berkeley Function-Calling Leaderboard v2	Overall Accuracy	80.88	83.38	+2.50
Webshop and ToolQuery results demonstrate general agent capabilities beyond just function calling.
Webshop	Success Rate	0.375	0.414	+0.039
ToolQuery	Success Rate	0.750	0.683	-0.067
ToolQuery	Success Rate	0.466	0.550	+0.084
ToolBench results show robust generalization to unseen instructions and tools.
ToolBench (Unseen Tools & Seen Cat)	Pass Rate	0.5050	0.5450	+0.0400

Experiment Figures

Scatter plot comparing Model Performance (Overall Accuracy on BFCL) vs Model Size (Billions of Parameters)

Main Takeaways

Data augmentation and unification significantly improve generalization, as evidenced by xLAM-7b-r's strong performance on Webshop and ToolBench compared to raw data baselines.
Synthetic data with execution verification is highly effective: the 1B parameter model (xLAM-1b) competes with GPT-3.5 and larger models on function calling, validating the APIGen approach.
xLAM-8x22b-r sets a new state-of-the-art for open weights on function calling, proving that open models can surpass proprietary ones (GPT-4) in specialized agent tasks.
Standardizing agent data formats allows models to maintain performance even when constrained to structured outputs, whereas models like GPT-4o degrade significantly (-42%) when forced into unified formats.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and instruction tuning
Familiarity with function calling / tool use in AI agents
Knowledge of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)

Key Terms

SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs) to learn a specific task

DPO: Direct Preference Optimization—a method to align models with human/system preferences by contrasting preferred vs. rejected outputs

Function Calling: The capability of an LLM to generate structured outputs (like JSON) that invoke specific software functions with correct arguments

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank decomposition matrices

AST: Abstract Syntax Tree—a tree representation of the abstract syntactic structure of source code, used here to evaluate the correctness of generated function calls

APIGen: A data synthesis framework used by the authors to generate verifiable function-calling datasets by executing the calls to ensure validity

Mixture-of-Experts (MoE): A model architecture that uses multiple sub-networks ('experts') and a gating mechanism to activate only a subset of them for each input token

FSDP: Fully Sharded Data Parallel—a memory optimization technique for distributed training that shards model parameters, gradients, and optimizer states across GPUs

Hallucination: When a model generates incorrect or non-existent information, such as inventing function arguments that weren't in the user query