Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

📝 Paper Summary

Visual Instruction Tuning Dataset Construction Vision-Language Models (VLMs)

Vision-Flan scales visual instruction tuning with 187 diverse human-labeled tasks to improve VLM capabilities and robustness, demonstrating that only 1,000 synthesized instances are needed for alignment.

Core Problem

Existing Vision-Language Models (VLMs) suffer from poor generalizability, hallucination, and catastrophic forgetting due to low task diversity in pre-training and reliance on biased GPT-4 synthesized instruction data.

Why it matters:

VLMs like LLaVA perform poorly on basic tasks like OCR because pre-training is dominated by image captioning and lacks diversity
Reliance on GPT-4 synthesized data introduces spurious correlations and a bias toward positive answers ('Yes'), causing severe hallucinations where models describe objects not present in the image
Visual instruction tuning often causes catastrophic forgetting, where VLMs lose performance on basic detection tasks (e.g., MNIST, CIFAR-10) compared to their base vision encoders

Concrete Example: When trained on GPT-4 synthesized data, VLMs frequently answer 'Yes' to questions about object existence even when the object is absent (hallucination) because the training data is biased. Additionally, standard VLMs fail basic OCR tasks absent from caption-heavy pre-training data.

Key Novelty

Two-Stage Tuning with Massive Task Diversity (Vision-Flan)

Constructs the Vision-Flan dataset by aggregating 187 diverse academic tasks (1.6M instances) re-formatted with expert-written instructions, moving beyond simple captioning
Proposes a two-stage tuning framework: first fine-tuning on diverse human-labeled tasks for capability, then tuning on a tiny subset (1,000 instances) of GPT-4 data for human-preference alignment
Empirically proves that visual instruction tuning primarily helps the LLM understand visual features, while the MLP connector is largely learned during pre-training

Architecture

The LLaVA model architecture and the proposed two-stage visual instruction tuning pipeline.

Evaluation Highlights

+3.1 points on MM-Bench and +6.5 points on MME compared to LLaVA-1.5, achieving state-of-the-art results
Maintains 84.0% average accuracy on catastrophic forgetting benchmarks (CF), significantly outperforming LLaVA-1.5's 73.3%
Achieves 78.3 on LLaVA-Bench (alignment metric) using only 1,000 GPT-4 synthesized instances, validating the efficiency of the two-stage approach

Breakthrough Assessment

8/10

Significantly challenges the trend of relying solely on synthesized data, providing a rigorous dataset and methodology that improves robustness and reduces hallucination with minimal synthetic data.

⚙️ Technical Details

Problem Definition

Setting: Visual Instruction Tuning of Multi-Modal Large Language Models

Inputs: Image I and text instruction T

Outputs: Text response R

Pipeline Flow

Input Image -> Vision Encoder (CLIP-ViT-L-336px)
Visual Features -> MLP Projection Layers
Projected Features + Text Instruction -> Large Language Model (Vicuna-13B v1.5)
LLM -> Generated Response

System Modules

Vision Encoder

Extract visual features from the input image

Model or implementation: CLIP-ViT-L-336px

MLP Projection

Map visual features to the LLM's token embedding space

Model or implementation: Two-layer MLP

Language Model

Generate text response based on visual tokens and text instruction

Model or implementation: Vicuna-13B v1.5

Modeling

Base Model: Vicuna-13B v1.5 (LLM) + CLIP-ViT-L-336px (Vision Encoder)

Training Method: Two-stage Supervised Fine-Tuning (SFT)

Adaptation: Full fine-tuning of MLP and LLM (Vision Encoder frozen)

Training Data:

Stage 1: VISION-FLAN (1,664,261 instances, 187 tasks)
Stage 2: 1,000 instances randomly sampled from LLaVA (GPT-4 synthesized) dataset

Key Hyperparameters:

stage_1_learning_rate: 2e-5
stage_1_batch_size: 16 per device (128 total on 8 GPUs)
stage_1_epochs: 1
+ 3 more
stage_2_learning_rate: 1e-5
stage_2_batch_size: 8 per device
stage_2_steps: 128

Compute: 8 A100 GPUs

Comparison to Prior Work

vs. LLaVA-1.5: Vision-Flan uses a two-stage approach (Capability then Alignment) and significantly more diverse tasks (187 vs ~12 categories)
vs. InstructBLIP: Vision-Flan targets broader task diversity and uses a simpler MLP architecture compared to Q-Former
vs. ShareGPT4V [not cited in paper]: Vision-Flan focuses on human-labeled academic tasks for correctness, whereas ShareGPT4V relies on scaling high-quality GPT-4V synthetic captions

Limitations

All tasks are in English, limiting multilingual applicability
Restricted to single-image inputs; does not support multi-image or video tasks
Analysis focuses on LLaVA architecture; generalization to other architectures (e.g., Q-former) is not tested

Reproducibility

The paper describes the dataset construction and training process in detail. The Vision-Flan dataset is composed of publicly available academic datasets. However, the code repository URL is not explicitly provided in the paper text.

📊 Experiments & Results

Evaluation Setup

Evaluation on comprehensive multi-modal benchmarks covering perception, reasoning, and hallucination.

Benchmarks:

MM-Bench (Comprehensive multi-modal evaluation (multiple choice))
MME (Perception and Cognition evaluation)
LLaVA-Bench (Open-ended conversation (Human preference proxy))
POPE (Object hallucination evaluation)
CF Benchmarks (Catastrophic Forgetting (CIFAR-10, CIFAR-100, MNIST, miniImageNet))

Metrics:

Accuracy
Score (MME total)
Relative Score (LLaVA-Bench)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Vision-Flan-Base (Stage 1) achieves SOTA on academic benchmarks but scores low on chat-style alignment. Vision-Flan-Chat (Stage 2) restores alignment with minimal data.
MME	Score	1531.3	1537.8	+6.5
MM-Bench	Accuracy	66.7	69.8	+3.1
LLaVA-Bench	Score	70.7	78.3	+7.6
POPE	Accuracy	83.6	86.1	+2.5
CF (Average)	Accuracy	73.3	84.0	+10.7
LLaVA-Bench	Score	63.9	78.3	+14.4

Experiment Figures

Performance on four benchmarks (MM-Bench, MME, POPE, MMMU) as a function of the number of unique tasks used in training.

Effect of the number of GPT-4 synthesized instances on human preference alignment (LLaVA-Bench).

Relationship between the number of GPT-4 training instances, the ratio of 'Yes' answers, and hallucination (POPE accuracy).

Main Takeaways

Increasing the number of human-labeled tasks directly correlates with improved VLM capabilities across benchmarks.
GPT-4 synthesized data mainly modulates response format (style) rather than adding fundamental capability; 1,000 instances are sufficient for this alignment.
Excessive GPT-4 data introduces bias (e.g., saying 'Yes' too often), leading to increased hallucination.
Visual instruction tuning primarily updates the LLM to understand visual features; the MLP connector's weights are largely established during pre-training.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) like LLaVA
Familiarity with Instruction Tuning concepts
Knowledge of catastrophic forgetting in neural networks

Key Terms

Visual Instruction Tuning: Fine-tuning a pre-trained VLM on pairs of images and instructions to improve its ability to follow user commands

Hallucination: When a model generates incorrect information, such as describing objects that are not present in the image

Catastrophic Forgetting: The tendency of a neural network to completely forget previously learned information (like basic classification tasks) upon learning new information

LLaVA-Architecture: A specific VLM design connecting a vision encoder (CLIP) to an LLM (Vicuna) via MLP layers

MLP: Multilayer Perceptron—simple fully connected neural network layers used here to project visual features into the LLM's embedding space

OCR: Optical Character Recognition—the task of detecting and reading text embedded within images