SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

📝 Paper Summary

Multi-modal Large Language Models (MLLMs) Visual Instruction Tuning

SPHINX enhances multi-modal LLMs by unfreezing the LLM during pre-training, mixing model weights from different data domains, aggregating diverse visual encoders, and processing high-resolution sub-images.

Core Problem

Existing MLLMs struggle with limited visual resolution (typically 224x224), domain conflict between synthetic and real-world training data, and lack of fine-grained visual perception due to frozen LLM weights or single-purpose tuning.

Why it matters:

Low resolution hinders fine-grained tasks like reading text in documents or detecting small objects
Frozen LLMs limit the potential for deep cross-modal alignment during pre-training
Training on purely synthetic data can degrade real-world performance, but mixing datasets naively confuses the model

Concrete Example: When asked to detect small objects or read dense text in a 448x448 image, standard MLLMs downsample it to 224x224, losing detail. SPHINX splits the image into four 224x224 corners plus a downsampled global view, allowing the LLM to 'see' the fine details.

Key Novelty

Three-fold Mixing Strategy (Weights, Tasks, Embeddings) + High-Res Sub-Image Processing

Mixes model weights by linearly combining an LLM tuned on real-world data with one tuned on synthetic data to capture diverse semantics without data conflict
Mixes visual embeddings from multiple encoders (CNN, ViT, Q-Former) to combine local, global, and patch-level features
Processes high-resolution images by cropping them into sub-images (e.g., 4 corners) and feeding them as a sequence of independent visual tokens to the LLM

Architecture

The joint mixing paradigm including task mixing, embedding mixing (from CLIP-ViT, ConvNeXt, DINOv2, Q-Former), and weight mixing.

Evaluation Highlights

Achieves 90.8 POPE score (SPHINX-1k), surpassing LLaVA-1.5-13B (85.9) and InstructBLIP-13B (78.9)
Reaches 80.2% accuracy on VQA v2 (SPHINX-1k), outperforming Qwen-VL-7B (79.5%) and LLaVA-1.5-13B (80.0%)
Attains 91.08% accuracy on RefCOCO test-A (SPHINX-1k), outperforming specialist model G-DINO-L (88.95%) and generalist Qwen-VL-7B (88.25%)

Breakthrough Assessment

8/10

Significant engineering breakthrough in handling high-resolution inputs via sub-image sequences without expensive architectural changes. Strong performance across diverse benchmarks confirms the effectiveness of the 'mixing' paradigm.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal instruction tuning and visual question answering/grounding

Inputs: Image I (optionally high-resolution) and natural language instruction T

Outputs: Text response R (answer, description, or bounding box coordinates)

Pipeline Flow

Visual Encoder Mix (processes image into tokens)
Embedding Mixing (concatenates features)
Linear Projection (aligns dimensions)
LLM (generates response)

System Modules

Visual Encoders (Input Processing)

Extract visual features using diverse architectures

Model or implementation: CLIP-ViT + CLIP-ConvNeXt + DINOv2-ViT + Q-Former

Embedding Mixer (Input Processing)

Combine features from different encoders

Model or implementation: Concatenation operations

Linear Projection (Input Processing)

Align visual token dimensions with LLM input space

Model or implementation: Two linear layers

Large Language Model

Process visual tokens and text instructions to generate response

Model or implementation: LLaMA-2 (13B or 7B)

Novel Architectural Elements

Sub-image processing pipeline: Inputs 448x448 images as 5 independent 224x224 patches (1 global downsampled + 4 local crops) into the same encoder, relying on the LLM to learn spatial relationships between patches
Hybrid Visual Encoder Assembly: Simultaneous use of CLIP-ConvNeXt, CLIP-ViT, DINOv2, and Q-Former features concatenated channel-wise and sequence-wise

Modeling

Base Model: LLaMA-2 (7B and 13B variants)

Training Method: Two-stage training: (1) Pre-training with unfrozen LLM, (2) Multi-task supervised fine-tuning

Adaptation: Full fine-tuning (LLM unfrozen in Stage 1)

Trainable Parameters: LLM weights + Linear Projection layers (Visual encoders frozen)

Key Hyperparameters:

learning_rate: 5e-5 (pre-training), 2e-5 (fine-tuning)
batch_size: 640 (pre-training), 128 (fine-tuning)
optimizer: AdamW ((0.9, 0.95))
+ 5 more
weight_decay: 0.1
schedule: Cosine annealing
pre_training_steps: 180,000
warmup_steps: 2,000
image_resolution: 224x224 (base), 448x448 (SPHINX-1k), 762x762 (SPHINX-2k)

Compute: Pre-training: ~125 hours on 32 A100 GPUs (7B model). Fine-tuning: ~38 hours on 16 A100 GPUs (13B model).

Comparison to Prior Work

vs. LLaVA: Unfreezes LLM during pre-training; mixes multiple visual encoders vs single CLIP encoder
vs. Qwen-VL: Achieves high resolution via sub-image cropping with standard 224x224 encoders rather than training a high-res encoder from scratch
vs. InstructBLIP: Mixes Q-Former with CNN/ViT features rather than relying solely on Q-Former
+ 1 more
vs. Shikra: Integrates broader task variety (pose estimation, layout detection) beyond just grounding

Limitations

High computational cost due to processing multiple sub-images (sequence length increases significantly)
Performance on text-oriented VQA still trails models with specialized text-related pre-training (e.g., Qwen-VL)
Depends on weight mixing coefficient which is a heuristic hyperparameter

Reproducibility

Code: https://github.com/Alpha-VLLM/LLaMA2-Accessory

Code publicly available at GitHub. Pre-training uses LAION-400M, LAION-COCO, and RefinedWeb. Fine-tuning uses a mixture of public datasets (VQA V2, GQA, OCRVQA, etc.). Visual backbones are open-source (CLIP, DINOv2).

📊 Experiments & Results

Evaluation Setup

Evaluation across diverse MLLM benchmarks covering general VQA, hallucination, math, and grounding.

Benchmarks:

MMBench (Multi-modal reasoning)
POPE (Object hallucination evaluation)
RefCOCO/+/g (Visual Grounding (REC))
VQA v2 (General Visual Question Answering)

Metrics:

Accuracy
F1 Score
Exact Match
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SPHINX variants demonstrate superior performance on hallucination and general VQA benchmarks compared to baselines like LLaVA and InstructBLIP.
POPE	Accuracy	85.9	90.8	+4.9
VQA v2	Accuracy	79.5	80.2	+0.7
RefCOCO test-A	Top-1 Accuracy@0.5	88.25	91.08	+2.83
MMBench	Accuracy	67.7	67.1	-0.6
MME	Score	1531.3	1560.2	+28.9

Experiment Figures

The pipeline for handling high-resolution images by mixing scales and sub-images.

Loss curves comparing pre-training with vs. without the RefinedWeb text-only dataset.

Main Takeaways

Mixing visual encoders and unfreezing the LLM provides a strong baseline for general visual understanding.
The 'sub-image' strategy (SPHINX-1k/2k) significantly boosts performance on fine-grained tasks like POPE (hallucination) and RefCOCO (grounding), proving that LLMs can stitch together local patch features effectively.
Weight mixing allows the model to leverage synthetic data (LAION-COCO) without losing the robustness of real-world data (LAION-400M).
SPHINX serves as a 'generalist' model capable of diverse tasks including pose estimation and layout detection, unlike specialist predecessors.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (ViT and LLM)
Visual Instruction Tuning concepts (LLaVA, InstructBLIP)
Knowledge of Contrastive Learning (CLIP) and Self-Supervised Learning (DINO)

Key Terms

LLM: Large Language Model—a neural network trained on vast text to generate human-like language

MLLM: Multi-modal Large Language Model—an LLM capable of processing inputs like images in addition to text

ViT: Vision Transformer—a model that processes images as sequences of patches using attention mechanisms

CLIP: Contrastive Language-Image Pre-training—a model trained to match images with their text descriptions

DINOv2: A self-supervised vision transformer trained without labels to learn robust visual features

Q-Former: A module from BLIP-2 that bridges frozen image encoders and LLMs using learnable query vectors

SAM: Segment Anything Model—a foundation model for image segmentation that can cut out objects based on prompts

Stable Diffusion: A generative AI model that creates images from text descriptions

RefinedWeb: A large-scale dataset of high-quality web text used to maintain LLM language capabilities during training

LAION-400M: A massive dataset of image-text pairs from the internet

LAION-COCO: A dataset where images have synthetic captions generated by an AI model

visual embeddings: Numerical vector representations of image content produced by an encoder

sub-images: Smaller cropped sections of a high-resolution image processed independently to preserve detail