LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

📝 Paper Summary

Multimodal Agents Tool Use / Tool Learning Visual Instruction Tuning

LLaVA-Plus extends a large multimodal model by training it to actively select and use external vision tools (like detectors and generators) to answer complex visual user requests.

Core Problem

Existing Multimodal Agents either lack broad skills (segmentation, generation) or rely on text-only LLMs to call tools without seeing the image, leading to poor planning and context grounding.

Why it matters:

Standard LMMs (Large Multimodal Models) cannot perform specialized tasks like editing images or precise segmentation without external help.
Tool-chaining methods (like Visual ChatGPT) use text prompts to call tools, but because the planner cannot see the image, it often hallucinates or invokes incorrect tools for the visual context.

Concrete Example: In a tool-chaining system, if a user asks about the location of a 'frisbee' in an image, a text-only planner might fail to invoke a detector if the caption misses the frisbee. LLaVA-Plus sees the image directly, recognizes the need for detection, calls the tool, and accurately reports the coordinates.

Key Novelty

End-to-End Visual Tool Learning (LLaVA-Plus)

Integrates a 'Skill Repository' of vision experts (tools) directly with an LMM that acts as a planner, trained to output structured 'Thought', 'Action', and 'Value' sequences.
Unlike prior tool agents, the planner sees the raw image during the decision-making process, allowing visual signals to guide which tool is selected.
Introduces a new pipeline for curating 'skill-oriented' multimodal instruction data where the model learns to invoke tools and summarize their outputs.

Architecture

The four-step workflow of LLaVA-Plus: Human input, Assistant planning (tool selection), Tool execution, and Assistant response generation.

Evaluation Highlights

Achieves state-of-the-art Elo rating of 1203 on VisIT-Bench, outperforming the base LLaVA model (1095) by over 100 points.
Surpasses commercial systems on the LLaVA-Bench (Tool Use) dataset, scoring 82.3 compared to Bing Chat's 70.2 and Bard's 76.3.
Outperforms MM-REACT (a tool-augmented LLM) on tool-use benchmarks (82.3 vs 76.5), validating the benefit of image-grounded planning.

Breakthrough Assessment

8/10

Significant step in making LMMs general-purpose agents. Bridging the gap between monolithic LMMs and modular tool use via instruction tuning is highly effective and practical.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Agentic Interaction

Inputs: Image I_q and User Instruction X_q

Outputs: Natural language response X_answer (potentially derived from intermediate tool execution results)

Pipeline Flow

User Input (Image + Text)
LMM Planner (Analyzes input -> Predicts Tool)
Tool Execution (External models process image)
LMM Responder (Aggregates Tool Result + Context -> Final Answer)

System Modules

LMM Planner

Analyzes image and text to generate 'Thought' and 'Action' (API call)

Model or implementation: LLaVA-Plus (based on Vicuna)

Skill Repository

Executes specialized vision tasks based on API calls

Model or implementation: Various (Grounding DINO, SAM, BLIP2, Stable Diffusion, etc.)

LMM Responder

Synthesizes final answer from user input and tool results

Model or implementation: LLaVA-Plus (same model as Planner)

Novel Architectural Elements

Unified prediction format: The LMM is trained to output a structured sequence (Thought, Action, Value) that handles both tool invocation and final response generation within the same auto-regressive generation process.

Modeling

Base Model: LLaVA (based on Vicuna LLM + CLIP ViT-L/14)

Training Method: Visual Instruction Tuning (Supervised Fine-Tuning)

Objective Functions:

Purpose: Auto-regressive language modeling on instruction sequences.

Formally: Standard cross-entropy loss on valid tokens (green tokens in examples), ignoring user instructions.

Adaptation: Full fine-tuning (assumed based on LLaVA architecture, though paper implies standard LLaVA tuning)

Trainable Parameters: LLM backbone and Projector (typical LLaVA setup)

Training Data:

LLaVA-158K dataset (for general chat)
New Tool-Use Instruction Data (~81K samples for Understanding skills, plus Generation/Knowledge sets)

Key Hyperparameters:

model_sizes: 7B and 13B
max_length: Not reported in the paper
batch_size: Not reported in the paper

Compute: Serves 7B model + all tools on a single 80G GPU.

Comparison to Prior Work

vs. MM-REACT/Visual ChatGPT: LLaVA-Plus sees the image during planning; MM-REACT relies on text descriptions which may miss visual details required for tool selection.
vs. GPT4Tools: LLaVA-Plus is an LMM (multimodal inputs), whereas GPT4Tools is an LLM that only sees visual signals *after* tools are activated.
vs. LLaVA: LLaVA-Plus adds the Skill Repository and is trained on tool-use data, enabling capabilities (segmentation, generation) LLaVA lacks.

Limitations

Hallucinations: The model may still generate incorrect tool arguments or textual responses.
Tool Use Conflicts: Challenges in selecting the optimal tool when multiple tools might apply.
Cost: Calling external tools increases inference latency and computational cost compared to a standalone model.

Reproducibility

Code: https://llava-vl.github.io/llava-plus/

publicly available (https://llava-vl.github.io/llava-plus/). The codebase, checkpoints, and the generated multimodal instruction data are released. The exact prompt templates for data generation are described in the appendix.

📊 Experiments & Results

Evaluation Setup

Multimodal chat and reasoning tasks, evaluated both on standard benchmarks and a new tool-specific benchmark.

Benchmarks:

VisIT-Bench (Real-world multimodal instruction following)
LLaVA-Bench (Tool Use) (Tool capabilities (Grounding, Tagging, Caption, OCR)) [New]
MM-Vet (Integrated vision-language capabilities)
SEED-Bench (Image/Instance level perception and reasoning)

Metrics:

Elo Rating
Accuracy / Score (relative to GPT-4)
Win Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VisIT-Bench results showing LLaVA-Plus achieving State-of-the-Art performance against open-source and human-verified references.
VisIT-Bench	Elo Rating	1095	1203	+108
VisIT-Bench	Win Rate (vs Human Ref)	18.53	35.07	+16.54
Performance on the authors' new LLaVA-Bench (Tool Use) measuring specific tool-enabled capabilities.
LLaVA-Bench (Tool Use)	Average Score	58.7	82.3	+23.6
LLaVA-Bench (Tool Use)	Average Score	76.5	82.3	+5.8
General multimodal capabilities evaluated on MM-Vet.
MM-Vet	Total Score	32.5	35.0	+2.5

Experiment Figures

Visual examples of new capabilities enabled by LLaVA-Plus.

Main Takeaways

Integrating tools via visual instruction tuning consistently outperforms 'Tool Chaining' (prompting LLMs) because the planner is grounded in the image.
Tool use significantly boosts performance on tasks requiring precision (OCR, Detection) which are weaknesses of standard LMMs.
The 'Skill-Oriented Dialogue' training format (Thought/Action/Value) effectively enables the model to plan and execute multi-step tool interactions.
LLaVA-Plus enables new capabilities like 'Edit and Post' (Generate image -> Caption it) that neither pure LMMs nor pure Generation models can do alone.

📚 Prerequisite Knowledge

Prerequisites

Large Multimodal Models (LMMs)
Instruction Tuning / Visual Instruction Tuning
Tool Use in Language Models (e.g., Toolformer)
Computer Vision tasks (Segmentation, Detection, OCR)

Key Terms

LMM: Large Multimodal Model—a model capable of processing and generating both text and images (e.g., LLaVA, GPT-4V)

Instruction Tuning: Training a model on dataset of (instruction, output) pairs to improve its ability to follow user commands

SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, specific dataset

Skill Repository: A collection of specialized pre-trained vision models (tools) that the LMM can call via API

Grounding DINO: An open-set object detection model that finds objects based on text descriptions

SAM: Segment Anything Model—a promptable segmentation system

Elo rating: A comparative ranking system used here to measure relative model performance against human preference

OCR: Optical Character Recognition—converting text in images into machine-encoded text

Hallucination: When a model generates incorrect or nonsensical information not supported by the input