← Back to Paper List

On VLMs for Diverse Tasks in Multimodal Meme Classification

Deepesh Gavit, Debajyoti Mazumder, Samiran Das, Jasabanta Patro
Alibaba Group
arXiv (2025)
MM Pretraining Benchmark Reasoning

📝 Paper Summary

Large Vision-Language Models (LVLMs) Video Understanding
Qwen2-VL integrates a dynamic resolution mechanism and multimodal rotary embeddings to process images and videos of any aspect ratio or length at native resolution without padding.
Core Problem
Traditional VLMs resize inputs to fixed resolutions (e.g., 336x336), destroying detail in high-res images or aspect ratios, while separate processing for images and videos prevents unified multimodal understanding.
Why it matters:
  • Fixed-resolution resizing makes text in documents or details in vertical/horizontal images unreadable
  • Padding images to squares wastes significant computational resources
  • Lack of unified 3D positional understanding limits performance on video tasks where temporal dynamics matter
Concrete Example: When processing a long vertical receipt, standard VLMs squash it into a square, making the text blurry and unreadable. Qwen2-VL processes it as a vertical strip of tokens at native resolution, preserving clarity.
Key Novelty
Naive Dynamic Resolution with Multimodal Rotary Embeddings (M-RoPE)
  • Treats images as variable-length sequences of patches based on their native resolution rather than resizing to a fixed grid, eliminating padding
  • Decomposes rotary positional embeddings into three components (time, height, width), creating a unified 3D coordinate system for both static images (time=1) and videos
Evaluation Highlights
  • +6.7% accuracy improvement on MathVista (Mini) for Qwen2-VL-72B compared to GPT-4o
  • Achieves 93.8% on DocVQA (test), outperforming GPT-4o and setting a new state-of-the-art for document understanding
  • SOTA performance on video understanding benchmarks like MVBench, surpassing GPT-4o by significant margins
Breakthrough Assessment
9/10
Introduces a foundational architectural shift (M-RoPE + Dynamic Resolution) that solves the long-standing resolution/aspect-ratio bottleneck in VLMs, delivering SOTA results across document, math, and video tasks.
×