SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

📝 Paper Summary

Multi-modal Large Language Models (MLLMs) Visual Instruction Tuning

SPHINX-X scales multi-modal LLMs from 1B to 8x7B parameters using a streamlined one-stage training pipeline, efficient visual encoding with skip tokens, and extensive multi-domain datasets.

Core Problem

Existing MLLMs are constrained by limited training data domains (mostly natural images) and a narrow range of model sizes (typically 7B/13B), hindering both edge deployment and complex reasoning.

Why it matters:

Narrow data coverage leads to poor performance on out-of-domain tasks like OCR, charts, and mathematical reasoning
Fixed model sizes (7B-13B) are too large for mobile devices yet insufficient for high-end reasoning capabilities
Redundant visual encoders and multi-stage training pipelines in prior work (like SPHINX) increase computational cost and complexity

Concrete Example: When processing high-resolution images with large aspect ratios (e.g., 2:1), standard tiling approaches generate fully-padded sub-images containing only zeros. These waste computation in the vision encoder and LLM, as the model processes useless tokens.

Key Novelty

SPHINX-X Family (Scaling & Simplification)

Eliminates redundant visual encoders from SPHINX, keeping only a complementary 'Mixture of Visual experts' (DINOv2 + CLIP-ConvNeXt)
Introduces learnable 'skip tokens' to replace fully-padded sub-images during tiling, reducing sequence length and computation
Consolidates training into a single-stage 'all-in-one' paradigm using a massive multi-domain dataset, including custom OCR and Set-of-Mark data

Architecture

The training pipeline and architecture of SPHINX-X, detailing the Mixture of Visual experts (MoV), skip token mechanism, and one-stage training.

Evaluation Highlights

SPHINX-Plus (13B) achieves 71.0 on MMBench, surpassing the original SPHINX (67.1) and LLaVA1.5-13B (67.7)
SPHINX-MoE (Mixtral 8x7B) demonstrates strong reasoning, reaching 36.8% on MathVista and 71.3% on SEED-Bench
SPHINX-Tiny (1.1B) achieves 56.6 on MMBench, outperforming larger baselines like InstructBLIP-7B (53.4) despite having far fewer parameters

Breakthrough Assessment

8/10

Offers a comprehensive open-source family of MLLMs covering diverse scales (1B to MoE) with solid performance gains and practical architectural improvements (skip tokens, simplified training).

⚙️ Technical Details

Problem Definition

Setting: Multi-modal instruction following and visual question answering across diverse domains

Inputs: Image I (possibly high-resolution) and natural language instruction T

Outputs: Textual response R generated by the LLM

Pipeline Flow

Image Preprocessing (Tiling & Padding)
Mixture of Visual Experts (MoV) Encoding
Skip Token Replacement
Projection to LLM Space
LLM Generation

System Modules

Image Tiling (Input Processing)

Splits high-resolution images into sub-images and a global downsampled view

Model or implementation: Rule-based cropping

Mixture of Visual Experts (MoV)

Encodes visual content using two complementary backbones

Model or implementation: DINOv2 + CLIP-ConvNeXt

Skip Token Mechanism (Input Processing)

Replaces embeddings of fully-padded sub-images with a learnable token

Model or implementation: Learnable Embedding

LLM Backbone

Generates text response based on visual and text inputs

Model or implementation: Various (TinyLlama-1.1B, InternLM2-7B, LLaMA2-13B, Mixtral-8x7B)

Novel Architectural Elements

Learnable skip tokens to bypass computation for fully-padded sub-images in high-aspect-ratio inputs
Streamlined 'Mixture of Visual experts' (MoV) using only DINOv2 and CLIP-ConvNeXt

Modeling

Base Model: Family of models: TinyLlama-1.1B, InternLM2-7B, LLaMA2-13B, Mixtral-8x7B

Training Method: One-stage all-in-one instruction tuning

Trainable Parameters: All LLM parameters and projection layers (Vision encoders frozen)

Training Data:

Public datasets (converted to multi-turn conversation): Language, Vision, Vision-Language tasks
OCR-intensive dataset (PaperText): 3M text-dense PDF pages
Set-of-Mark (SoM) dataset: Multi-domain images with fine-grained marks and GPT-4V generated captions

Compute: Not reported in the paper

Comparison to Prior Work

vs. SPHINX: Uses fewer vision encoders (2 vs 4), single-stage training (vs 2-stage), and adds skip tokens
vs. LLaVA-1.5: Leverages a richer 'Mixture of Visual experts' (DINOv2 + ConvNeXt) compared to LLaVA's single CLIP encoder
vs. Qwen-VL: Supports a wider range of base model scales (1B to 8x7B MoE)

Limitations

No specific computational cost or training time metrics provided
Performance on MME (Perception) for SPHINX-Plus (1457.7) is lower than original SPHINX (1560.2) despite improvements elsewhere
Reliance on GPT-4/GPT-4V for data generation/annotation introduces dependency on closed-source models

Reproducibility

Code: https://github.com/Alpha-VLLM/LLaMA2-Accessory

Code and models are publicly released at https://github.com/Alpha-VLLM/LLaMA2-Accessory. The paper details data collection sources but does not specify training hours or GPU resources.

📊 Experiments & Results

Evaluation Setup

Comprehensive benchmarking across multi-modal tasks including perception, reasoning, and OCR

Benchmarks:

MMBench (MMB) (Multi-modal reasoning evaluation)
MME (Perception and Cognition evaluation)
MM-Vet (Integrated capability evaluation)
MathVista (Visual mathematical reasoning)
SEED-Bench (Generative multi-modal benchmark)

Metrics:

Accuracy
Score (benchmark specific)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of SPHINX-Plus (13B) against the original SPHINX (13B) shows improvements in reasoning-heavy benchmarks.
MMBench (MMB)	Score	67.1	71.0	+3.9
MM-Vet	Score	36.6	47.9	+11.3
MathVista	Score	27.5	36.8	+9.3
Scaling analysis shows performance gains with larger models and MoE architectures.
MME (Cognition)	Score	283.6	367.1	+83.5
MMBench (MMB)	Score	53.4	56.6	+3.2

Experiment Figures

Radar chart comparing SPHINX-X variants against other MLLMs across multiple tasks.

Main Takeaways

Scaling up parameters (to 8x7B MoE) consistently boosts multi-modal understanding and reasoning capabilities.
Small models (1.1B) can achieve competitive performance suitable for edge devices when trained with high-quality data and efficient architectures.
Enriching dataset diversity (OCR, Set-of-Mark) and scale significantly benefits performance on reasoning-intensive benchmarks like MathVista and MM-Vet.
Simplifying the visual encoder to a dual-expert MoV and using skip tokens improves efficiency without compromising (and often improving) accuracy.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture
Vision Transformers (ViT)
Multi-modal Large Language Models (MLLM)
Mixture of Experts (MoE)
Visual Instruction Tuning

Key Terms

MoV: Mixture of Visual experts—the combination of distinct vision encoders (DINOv2 and CLIP-ConvNeXt) used to capture complementary visual features

Skip token: A learnable token used to replace fully-padded (all-zero) sub-images in the input sequence, saving computation

Set-of-Mark (SoM): A prompting technique where regions of an image are marked with identifiers (boxes, numbers) to facilitate fine-grained referencing and grounding

MoE: Mixture of Experts—a model architecture where different 'expert' sub-networks are sparsely activated for different inputs

OCR: Optical Character Recognition—converting images of text into machine-encoded text

DINOv2: A self-supervised vision transformer model known for learning robust visual features

CLIP-ConvNeXt: A ConvNeXt model trained with CLIP (Contrastive Language-Image Pre-training) objective