Monkey: Image Resolution and Text Label are Important Things for Large Multi-Modal Models

📝 Paper Summary

Large Multimodal Models (LMMs) Vision-Language Pre-training High-Resolution Image Processing

Monkey enhances Large Multimodal Models by processing high-resolution images via sliding window patches with individual adapters and training on automatically generated multi-level descriptions.

Core Problem

Existing LMMs struggle with high-resolution inputs due to the high cost of training large vision encoders from scratch, and suffer from poor image-text alignment because standard training datasets have overly simple captions.

Why it matters:

Low resolution limits the detection of small objects and dense text, crucial for tasks like document understanding and detailed scene analysis.
Simple captions (e.g., 'a dog') fail to teach models complex spatial relationships and attributes, leading to hallucinations or missed details.
Retraining vision encoders for higher resolutions is computationally prohibitive for many researchers.

Concrete Example: In an image of a storefront, a standard LMM might see 'a store', but fail to read the small 'Emporio Armani' text or notice a specific person in the background because the input was downscaled to 448x448.

Key Novelty

Resolution Enhancement via Sliding Window Patches & Multi-Level Description Generation

Instead of retraining a vision encoder for large images, the image is split into patches. Each patch is processed by the same frozen encoder (originally trained on smaller images) but with a unique Low-Rank Adapter (LoRA) to handle spatial variations.
To improve training data quality, a multi-stage pipeline combines multiple specialized models (BLIP2, PPOCR, GRIT, SAM, ChatGPT) to generate layered captions covering global context, specific regions, objects, and text.

Architecture

The Monkey architecture showing image patching, shared ViT with LoRA adapters, resampler, and LLM.

Evaluation Highlights

+9.77% average improvement over Qwen-VL on Document-oriented VQA tasks (e.g., DocVQA, ChartQA) due to higher resolution handling.
Achieved a perception score of 1505.3 on the MME benchmark, ranking second among tested models.
Outperforms GPT-4V in qualitative tests for dense text question answering, successfully identifying small text elements like store names that GPT-4V missed.

Breakthrough Assessment

8/10

Significantly improves resolution handling without expensive pre-training and demonstrates the critical importance of data quality via multi-level descriptions. Strong results on document/text tasks.

⚙️ Technical Details

Problem Definition

Setting: Vision-Language Instruction Tuning

Inputs: High-resolution image I and natural language instruction/question

Outputs: Natural language response (caption or answer)

Pipeline Flow

Image Partitioning (Sliding Window)
Visual Encoding (ViT with LoRA)
Resampling (Shared Resampler)
LLM Generation (Qwen-VL base)

System Modules

Image Partitioner

Divides high-res input image into patches matching the encoder's native size (e.g., 448x448) using a sliding window.

Model or implementation: Sliding Window Mechanism

Visual Encoder

Extracts features from each patch and the global image independently.

Model or implementation: Vit-BigG (from Qwen-VL) with LoRA adapters

Visual Resampler

Summarizes visual information and aligns it with language feature space.

Model or implementation: Perceiver Resampler (inspired by Flamingo)

Large Language Model

Generates the final text response based on visual embeddings and text instruction.

Model or implementation: Qwen-VL LLM (7.7B parameters)

Novel Architectural Elements

Patch-specific processing where a single frozen ViT processes multiple crops of a large image, potentially with distinct LoRA adapters for each patch position to handle spatial variations without resizing.
Integration of a global resized view alongside local patches to preserve structural information.

Modeling

Base Model: Qwen-VL (Vit-BigG encoder + 7.7B LLM)

Training Method: Instruction Tuning with LoRA

Objective Functions:

Purpose: Minimize the difference between generated text and ground truth.

Formally: Standard language modeling loss (cross-entropy on next token prediction).

Adaptation: LoRA (rank=16 for attention, 32 for MLP in encoder)

Trainable Parameters: Visual Resampler (90M), LoRA parameters (117M). Encoder and LLM backbone are largely frozen/adapted.

Training Data:

1.44M total examples
Image Captioning: COCO, TextCaps, detailed captions generated by Monkey method
General VQA: VQAV2, OKVQA, GQA, ScienceQA, VizWiz
Text-centric VQA: TextVQA, OCRVQA, AI2D
Document VQA: DocVQA, ChartQA, InfoVQA, DeepForm, etc.
CC3M (427k subset) with regenerated multi-level descriptions

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 1024
warmup_steps: 100
+ 4 more
weight_decay: 0.1
optimizer: AdamW
beta1: 0.9
beta2: 0.95

Compute: 40 A800 days for one epoch

Comparison to Prior Work

vs. Qwen-VL: Monkey supports 896x896 (up to 1344x896) via patching without retraining the encoder, whereas Qwen-VL is trained natively at 448.
vs. LLaVA-1.5: Monkey uses a resampler and sliding window patching for higher res, LLaVA uses direct resizing and MLP.
vs. PaLI-X: Monkey avoids expensive curriculum pre-training for resolution scaling.

Limitations

Maximum of six patches due to LLM context length limits.
Multi-level description generation is bound by the world knowledge of the teacher models (BLIP2, CC3M annotations) and cannot identify specific geolocation if not already known.
Performance on some datasets (e.g., TextVQA) may saturate or slightly drop at extremely high resolutions if the original images aren't that large.
Requires running multiple large models (BLIP2, SAM, GRIT, PPOCR, ChatGPT) to generate the training data pipeline.

Reproducibility

Code: https://github.com/Yuliang-Liu/Monkey

Code is available at https://github.com/Yuliang-Liu/Monkey. Model weights and specific generated datasets are not explicitly linked in the text but implied to be released with the repo.

📊 Experiments & Results

Evaluation Setup

Evaluated on 18 diverse datasets covering captioning, general VQA, text-centric VQA, and document VQA.

Benchmarks:

MME (Perception and Cognition Benchmark)
TextVQA (Scene Text VQA)
DocVQA (Document VQA)
VQAv2 (General VQA)

Metrics:

Accuracy
CIDEr (for captioning)
Score (MME)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MME	Perception Score	1487.5	1505.3	+17.8
Monkey shows significant improvements in document-oriented VQA tasks due to its high-resolution processing capabilities.
DocVQA	Accuracy	62.6	66.5	+3.9
ChartQA	Accuracy	66.3	67.6	+1.3
DeepForm	Accuracy	59.0	68.3	+9.3
TextVQA	Accuracy	61.5	77.5	+16.0
VizWiz	Accuracy	47.7	53.6	+5.9

Main Takeaways

High resolution is critical for text-centric and document tasks (DocVQA, TextVQA), yielding large gains over lower-resolution baselines.
The 'Monkey' method of patching + LoRA enables high-resolution processing without the massive cost of retraining the vision encoder from scratch.
Multi-level description generation improves model performance by providing richer training signals than standard short captions, especially when combined with high-resolution inputs.
Ablation studies show that using LoRA with the patching strategy is more effective than simple interpolation, and multiple LoRAs can help spatial understanding.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Vision Transformer and LLMs)
Low-Rank Adaptation (LoRA) for efficient fine-tuning
Visual Question Answering (VQA) tasks

Key Terms

LoRA: Low-Rank Adaptation—a technique to fine-tune large models by injecting trainable low-rank matrices while freezing the main weights

LMM: Large Multimodal Model—a model capable of processing and generating content across multiple modalities, typically image and text

ViT: Vision Transformer—a model architecture that processes images as sequences of patches using self-attention mechanisms

sliding window: A method of processing a large image by moving a fixed-size window over it to extract smaller crops/patches

resampler: A module (often using cross-attention) that compresses a variable number of visual features into a fixed number of tokens for the LLM

grounding: Linking textual concepts (like object names) to specific visual regions (like bounding boxes) in an image

hallucination: When a model generates plausible but incorrect information not present in the source input

zero-shot: The ability of a model to perform a task without having explicitly seen examples of that specific task during training