Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model

📝 Paper Summary

Multi-turn w. user interactions Tool-use post-training

CoALM is a unified model family trained on a novel dataset merging multi-turn dialogue, tool use, and ReAct reasoning to master both conversational and agentic tasks simultaneously.

Core Problem

Existing models specialize in either task-oriented dialogue (TOD) or tool calling (Language Agents), failing to generalize; TOD systems lack diverse API support, while agents struggle with multi-turn intent tracking.

Why it matters:

Users require agents that can both clarify ambiguous intent through conversation and execute complex actions via diverse APIs
Current approaches require expensive fine-tuning or brittle prompt engineering to adapt to new services
A significant performance gap exists between open-source models and proprietary models like GPT-4 on integrated conversational tasks

Concrete Example: When a user says 'Find me a hotel', a pure tool-use agent might fail to ask clarifying questions (location, price) and just call a generic API, while a traditional dialogue system can ask questions but cannot call new, unseen APIs like 'search_direct_flight' without retraining.

Key Novelty

Unified Conversational Agentic Language Model (CoALM)

Creates CoALM-IT, a dataset interleaving three skills: dialogue state tracking (TOD), complex function calling (LA), and a novel Conversational ReAct (CRA) format
CRA introduces multi-step 'thought' processes into multi-turn dialogue: one thought for deciding API calls and another for formulating user responses
Trains a single model family to handle both state tracking and dynamic API usage without switching models or pipelines

Architecture

The CoALM training pipeline, showing the three data sources (TOD, LA, CRA) merging into the CoALM-IT dataset for multi-task fine-tuning.

Evaluation Highlights

CoALM-70B achieves +2.2% success rate over GPT-4o on MultiWOZ 2.4 (TOD benchmark) while maintaining strong function calling capabilities
On BFCL V3 (function calling), CoALM-70B outperforms GPT-4o with an accuracy of 80.50% vs 78.43%
CoALM-8B outperforms the specialized ToolAce-8B model by +12.5% on MultiWOZ Success Rate, showing superior multi-turn management

Breakthrough Assessment

8/10

Successfully bridges the gap between dialogue systems and agentic tool-use models with a unified open-source approach that beats GPT-4o on key benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn task-oriented dialogue with dynamic function calling capabilities

Inputs: User utterance and conversation history

Outputs: Next action (API call, dialogue state update, or natural language response)

Pipeline Flow

Instruction Processing (Unified Prompt)
Reasoning/Thought Generation (Step 1: API Decision)
Action Execution (API Call / State Update)
Observation Integration
Reasoning/Thought Generation (Step 2: Response Formulation)
Final Response Generation

System Modules

CoALM Model

Unified model for state tracking, decision making, and response generation

Model or implementation: Llama 3.1 8B / Llama 3.3 70B / Llama 3.1 405B (fine-tuned)

Novel Architectural Elements

Integration of a dual-thought ReAct loop (Thought1 -> Action -> Observation -> Thought2 -> Response) directly into the fine-tuning data of a TOD model

Modeling

Base Model: Llama 3.1 8B, Llama 3.3 70B, Llama 3.1 405B

Training Method: Supervised Fine-Tuning (SFT) on CoALM-IT dataset

Objective Functions:

Purpose: Minimize cross-entropy loss on target tokens (thoughts, actions, responses).

Formally: Standard language modeling loss.

Adaptation: LoRA (rank=16, α=32) for all linear layers

Trainable Parameters: LoRA adapters (approx 8 hours for 8B, 60 hours for 70B)

Training Data:

TOD: SNIPS (24,542 samples) converted to instruction format
LA: Hammer and ToolACE datasets (216,319 samples) for function calling
CRA: 82,236 ReAct-style multi-turn samples generated via GPT-4o from SGD dataset
Total: CoALM-IT merged dataset

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 8
epochs: 3
+ 3 more
lora_rank: 16
lora_alpha: 32
optimizer: AdamW (implied)

Compute: 8 NVIDIA H100 GPUs

Comparison to Prior Work

vs. ToolAce/Hammer: CoALM integrates multi-turn dialogue management, whereas specialized agents fail at conversation state tracking
vs. GPT-4o: CoALM is open-weights and achieves better performance on specific TOD and function-calling benchmarks
vs. Traditional TOD: CoALM handles unseen APIs and complex function calling, unlike TOD systems restricted to fixed schemas

Limitations

Relies on GPT-4o for generating the synthetic CRA training data, inheriting potential biases
Zero-shot evaluation setting may not fully reflect few-shot performance capabilities of larger proprietary models
Computational cost of 405B model is high for deployment despite LoRA efficiency

Reproducibility

Code: https://emrecanacikgoz.github.io/CoALM/

Code, model weights, datasets, and training configs are publicly released. Training used Oumi framework. Base models are Llama 3 series. No proprietary closed-source dependencies for the final model (though GPT-4o was used for data generation).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation across three benchmarks representing TOD and LA domains

Benchmarks:

MultiWOZ 2.4 (Task-Oriented Dialogue (TOD))
Berkeley Function Calling Leaderboard (BFCL) V3 (Function Calling / Language Agent (LA))
API-Bank (Function Calling / Language Agent (LA))

Metrics:

Success Rate (MultiWOZ)
Turn Accuracy (MultiWOZ)
Inform Rate (MultiWOZ)
Accuracy (BFCL, API-Bank)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance on MultiWOZ 2.4 (TOD) showing CoALM's dominance in dialogue tasks.
MultiWOZ 2.4	Success	78.80	81.00	+2.20
MultiWOZ 2.4	Success	62.50	75.00	+12.50
Performance on BFCL V3 (Function Calling) demonstrating CoALM competes with or beats top agents.
BFCL V3	Accuracy	78.43	80.50	+2.07
BFCL V3	Accuracy	63.09	78.11	+15.02
Results on API-Bank showing generalization to other tool-use benchmarks.
API-Bank	Accuracy	57.50	70.21	+12.71

Experiment Figures

Radar chart comparing CoALM models against GPT-4o and specialized models across MultiWOZ, BFCL, and API-Bank.

Main Takeaways

Specialized Language Agents (e.g., ToolAce) excel at function calling but fail at multi-turn dialogue (low MultiWOZ success).
Traditional TOD systems or base LLMs struggle with complex, unseen function calling (low BFCL accuracy).
CoALM effectively bridges this gap, achieving SOTA or near-SOTA performance on both TOD and LA benchmarks simultaneously.
Scaling from 8B to 70B/405B provides consistent performance gains across all tasks.

📚 Prerequisite Knowledge

Prerequisites

Task-Oriented Dialogue (TOD) systems
Language Agents (LA) and tool use
ReAct (Reasoning + Acting) prompting
Instruction tuning (SFT)

Key Terms

TOD: Task-Oriented Dialogue—systems designed to help users accomplish specific goals like booking hotels

LA: Language Agents—LLMs capable of using external tools (APIs) to perform actions

ReAct: Reasoning and Acting—a prompting technique where models generate reasoning traces before taking actions

CoALM-IT: The integrated multi-task dataset created in this paper, combining TOD, LA, and CRA data

CRA: Conversational ReAct API—a dataset format introduced here where agents use multiple 'think' steps for API decision and response generation

DST: Dialogue State Tracking—estimating the user's goal/intent at each turn of a conversation

BFCL: Berkeley Function Calling Leaderboard—a benchmark for evaluating LLMs' ability to call functions correctly

MultiWOZ: Multi-Domain Wizard-of-Oz—a standard benchmark for multi-turn task-oriented dialogue

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique

QLoRA: Quantized Low-Rank Adaptation—LoRA applied to quantized models for memory efficiency