← Back to Paper List

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, et al.
Google
arXiv
MM Pretraining Benchmark Reasoning Speech Agent

📝 Paper Summary

Multimodal Foundation Models Large Language Models
Gemini is a family of multimodal models trained jointly on image, audio, video, and text that achieves state-of-the-art performance across 30 of 32 benchmarks, including human-expert level on MMLU.
Core Problem
Previous multimodal models were often trained on separate components and stitched together, limiting their ability to perform deep cross-modal reasoning or handle complex interleaved inputs effectively.
Why it matters:
  • Models tailored to single domains lack the generalist capabilities needed for complex real-world tasks involving mixed media (e.g., video, audio, and text simultaneously)
  • Stitched-together multimodal approaches often struggle with fine-grained understanding and reasoning that requires native integration of modalities from the start
  • Existing models failed to reach human-expert performance on broad knowledge benchmarks like MMLU
Concrete Example: A teacher draws a physics problem of a skier on a slope with messy handwriting. A standard text-only model cannot see it; a standard image-captioner misses the mathematical nuance. Gemini natively reads the handwriting, understands the physics problem, identifies a student's specific reasoning error, and outputs the correct solution in LaTeX.
Key Novelty
Natively Multimodal Joint Training
  • Models are trained from the start on a dataset containing interleaved text, images, audio, and video, rather than training a text model and grafting on vision encoders later
  • Outputs can be natively interleaved text and images (using discrete image tokens), allowing for diverse generative tasks beyond just text responses
  • Audio is ingested directly as signals at 16kHz via Universal Speech Model features, preserving nuances lost when converting audio to text first
Evaluation Highlights
  • Gemini Ultra achieves 90.04% on MMLU (Massive Multitask Language Understanding), becoming the first model to exceed the human-expert score of 89.8%
  • Achieves state-of-the-art on 30 of 32 benchmarks evaluated, including 10/12 text/reasoning, 9/9 image understanding, and 6/6 video understanding benchmarks
  • On the MMMU multimodal reasoning benchmark, Gemini Ultra scores 62.4%, outperforming the previous state-of-the-art (GPT-4V) by over 5 percentage points
Breakthrough Assessment
10/10
Sets new SOTA on nearly every major benchmark (text, code, multimodal). First to crack human-expert MMLU performance. Natively multimodal architecture represents a significant shift from modular approaches.
×