Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

📝 Paper Summary

Vision-Language Models (VLMs) Video Understanding

Molmo2 is a fully open-source family of vision-language models trained on 9 novel datasets to achieve state-of-the-art video understanding, pointing, and tracking without distilling proprietary models.

Core Problem

Strong video-language models remain proprietary, and open alternatives often rely on distilled data from closed models or lack crucial grounding capabilities like pointing and tracking in time.

Why it matters:

Open-source community lacks foundational data and recipes to improve state-of-the-art video models without relying on proprietary distillation
Downstream applications (robotics, sports analytics) need precise grounding—pointing to moments or tracking objects—which most current VLMs cannot do
Existing open datasets are often short, lack density, or do not support complex temporal reasoning

Concrete Example: A user asks 'How many times does the robot grasp the red block?'. A standard VLM answers '3 times' (text only). Molmo2 answers '3 times' and outputs specific temporal timestamps and spatial coordinates for each grasp event, enabling the user to jump to those exact moments.

Key Novelty

Fully Open Video Grounding & Understanding Pipeline (Molmo2)

Introduces 9 new human-annotated and synthetic datasets built without proprietary model distillation, covering dense video captions, tracking, and pointing
Extends 2D image pointing to the temporal domain, enabling models to output points for events in space and time (video pointing) and continuous object tracks
Utilizes a training recipe with efficient sequence packing, message-tree encoding, and token-weighting to handle diverse inputs (images, multi-images, videos)

Architecture

The Molmo2 architecture showing the connection between the Vision Encoder and LLM, specifically how video frames and point coordinates are handled.

Evaluation Highlights

Outperforms Qwen3-VL on video counting (35.5 vs 29.6 accuracy) and matches proprietary models like Gemini 3 Pro on video pointing (38.4 vs 20.0 F1)
Achieves best-in-class performance among open-weight models on short video benchmarks and captioning, with 86.2 average on short-video QA tasks
Surpasses Gemini 2.5 Pro on tracking benchmarks (ReasonVOS: 78.8 vs 52.6 J&F) and sets new state-of-the-art for open models

Breakthrough Assessment

9/10

Significant contribution by releasing high-quality, non-distilled video grounding data and models that rival proprietary systems. The addition of temporal pointing/tracking to a generalist VLM is a major capability jump.

⚙️ Technical Details

Problem Definition

Setting: Unified Vision-Language Modeling for Single Image, Multi-Image, and Video inputs

Inputs: Text query q combined with visual input V (image, set of images, or video frames)

Outputs: Text response A, potentially containing special coordinates for pointing (x, y, t) or tracking objects

Pipeline Flow

Visual Encoder (SigLIP/CLIP style)
Connector (Pooling & Projection)
LLM Backbone (Qwen/OLMo)
Output Generation (Text + Coordinates)

System Modules

Vision Encoder (Input Processing)

Encode images or video frames into patch features

Model or implementation: ViT (Vision Transformer), utilizing crops (up to 24 for inference)

Connector (Input Processing)

Pool and project visual features into LLM token space

Model or implementation: MLP with attention pooling

LLM Backbone

Autoregressive generation of text and points

Model or implementation: Qwen2.5-7B (for 4B/8B models) or OLMo-7B

Novel Architectural Elements

Unified handling of video frames and multi-crop images via distinct pooling strategies (2x2 vs 3x3) in a shared connector
Integration of explicit point coordinates (x, y, t, ID) into the LLM vocabulary for temporal grounding and tracking

Modeling

Base Model: Qwen2.5-7B (Molmo2-8B), Qwen2.5-3B (Molmo2-4B), OLMo-7B (Molmo2-O-7B)

Training Method: Three-stage SFT: Pre-training → Joint SFT → Long-context SFT

Objective Functions:

Purpose: Minimize prediction error for next token.

Formally: Standard cross-entropy loss on text and point tokens

Adaptation: Full fine-tuning of ViT, Connector, and LLM

Trainable Parameters: All parameters (ViT + Connector + LLM)

Training Data:

Stage 1 (Pre-training): PixMo-Cap, Tulu (NLP), PixMo-Points/Count
Stage 2 (SFT): Mixture of 9 new Molmo2 datasets (Video Cap, QA, Point, Track) + academic datasets
Stage 3: Long-context training on same mixture with sequence length 36k

Key Hyperparameters:

batch_size: 128
max_sequence_length_sft: 16,384
max_sequence_length_long: 36,864
+ 4 more
sft_steps: 30,000
long_context_steps: 2,000
token_weight_video_cap: 0.1
token_weight_pointing: 0.2

Compute: Not reported in the paper

Comparison to Prior Work

vs. Qwen2-VL: Molmo2 explicitly models temporal pointing and object tracking via point tokens, significantly outperforming on grounding tasks
vs. LLaVA-Video: Molmo2 uses human-narrated dense captions rather than GPT-generated ones, avoiding distillation bias
vs. Gemini 3 Pro: Molmo2 is fully open weights/data and outperforms Gemini on fine-grained tracking metrics (J&F)

Limitations

Lags behind best open-weight models on some OCR-heavy benchmarks (e.g., DocVQA)
Performance on very long videos (10+ mins) limited by lack of open-source long training data
Slightly behind on math/reasoning benchmarks (MathVista, MMMU) compared to proprietary giants
Video captioning performance drops if pool size is increased, indicating sensitivity to visual token density

Reproducibility

Code: https://github.com/allenai/molmo2

Models (4B, 8B, 7B-O), datasets (Molmo2-Cap, VideoPoint, VideoTrack, etc.), and training code are publicly released. Training requires efficient packing and message-tree implementation provided in the repo. No proprietary model distillation was used for data creation.

📊 Experiments & Results

Evaluation Setup

Comprehensive evaluation across video understanding, captioning, counting, pointing, tracking, and image benchmarks

Benchmarks:

MVBench (Video Understanding)
Molmo2-CapTest (Dense Video Captioning) [New]
Molmo2-VideoPointVal (Spatio-temporal Pointing) [New]
BURST-VideoCount (Video Counting)
ReasonVOS (Video Object Tracking)

Metrics:

Accuracy
F1 Score
J&F (Jaccard and F-measure)
Elo Score (Human Preference)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Molmo2 achieves state-of-the-art results among open-weight models on general video understanding benchmarks.
MVBench	Accuracy	68.7	75.9	+7.2
Video-MME	Accuracy	71.4	69.9	-1.5
In grounding tasks (pointing and tracking), Molmo2 shows massive improvements over existing models, including proprietary ones.
Molmo2-VP (Video Pointing)	F1	20.0	38.4	+18.4
ReasonVOS	J&F	62.7	81.3	+18.6
BURST-VC (Video Counting)	Accuracy	29.6	35.5	+5.9

Experiment Figures

Overview of the Molmo2 capabilities, data pipeline, and key tasks (Spatio-Temporal Localization, Fine-grained Understanding, Object Tracking).

Main Takeaways

Molmo2 sets a new standard for open-weight video grounding, excelling at pointing and tracking where previous models failed or didn't exist.
Human-annotated dense video captions (Molmo2-Cap) are critical; distilling proprietary models is not necessary for SOTA performance.
Token-weighting and bi-directional attention are essential training strategies for balancing diverse tasks (captioning vs. QA) in a single model.
While competitive on short videos, open models (including Molmo2) still lag on very long video understanding due to data scarcity.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (ViT and LLM)
Vision-Language Connectors (projecting visual features to LLM space)
Instruction Tuning / Supervised Fine-Tuning (SFT)

Key Terms

VLM: Vision-Language Model—an AI model that processes both images/video and text to generate text outputs

Grounding: The ability of a model to link textual concepts to specific pixels or timeframes in the visual input (e.g., bounding boxes or points)

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs to teach it to follow instructions

ViT: Vision Transformer—a neural network architecture that processes images by splitting them into patches

Message-tree: A data structure used during training where a single visual input is the root, and multiple distinct QA pairs are branches, packed into one sequence with masking to prevent cross-contamination

Packing: Combining multiple short training examples into a single long sequence to maximize GPU efficiency

Token-weighting: Assigning different loss weights to tokens from different tasks (e.g., lower weights for long captions) to balance learning

J&F: Jaccard and F-measure—a standard metric for evaluating video object segmentation accuracy

Elo Score: A comparative ranking system used here to measure human preference between model outputs

Distillation: Training a smaller student model using outputs from a larger, often proprietary, teacher model