← Back to Paper List

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, Lijuan Wang
Microsoft Corporation
arXiv
MM Reasoning Benchmark Agent

📝 Paper Summary

Large Multimodal Models (LMMs) Prompt Engineering Visual Reasoning
This report provides a comprehensive qualitative exploration of GPT-4V(ision)'s capabilities, input modes, and prompting techniques to demonstrate its potential as a powerful multimodal generalist system.
Core Problem
The capabilities, working modes, and effective prompting strategies for state-of-the-art Large Multimodal Models (LMMs) like GPT-4V are largely unexplored and undocumented.
Why it matters:
  • Existing research relies on limited models or data scales, restricting the emergence of advanced abilities found in large-scale systems
  • Understanding these capabilities is crucial for developing next-generation multimodal tasks and leveraging LMMs for real-world problem solving
Concrete Example: A user wants to count apples in an image but a simple prompt fails; the paper shows how techniques like 'condition on good performance' (e.g., 'Let's count row-by-row to be sure') enable the model to succeed where standard prompts fail.
Key Novelty
Comprehensive Qualitative Exploration of GPT-4V
  • Systematically categorizes supported input modes, including unique capabilities like processing interleaved image-text and visual pointers drawn on images
  • Identifies and evaluates effective prompting techniques specific to LMMs, such as 'visual referring prompting' where users edit pixels to instruct the model
Evaluation Highlights
  • Demonstrates human-level capability across diverse domains including celebrity recognition, medical imaging, and abstract visual reasoning
  • Showcases 'visual referring prompting' (drawing on images) as a viable new interaction method for precise instruction
  • Validates the model's ability to handle arbitrarily interleaved image-text inputs for complex reasoning tasks
Breakthrough Assessment
9/10
This is a foundational report establishing the baseline capabilities and prompting paradigms for modern LMMs. It defines the 'visual referring prompting' technique and comprehensively maps the landscape of GPT-4V's abilities.
×