Gemini: A Family of Highly Capable Multimodal Models

📝 Paper Summary

Multimodal Foundation Models Large Language Models

Gemini is a family of multimodal models trained jointly on image, audio, video, and text that achieves state-of-the-art performance across 30 of 32 benchmarks, including human-expert level on MMLU.

Core Problem

Previous multimodal models were often trained on separate components and stitched together, limiting their ability to perform deep cross-modal reasoning or handle complex interleaved inputs effectively.

Why it matters:

Models tailored to single domains lack the generalist capabilities needed for complex real-world tasks involving mixed media (e.g., video, audio, and text simultaneously)
Stitched-together multimodal approaches often struggle with fine-grained understanding and reasoning that requires native integration of modalities from the start
Existing models failed to reach human-expert performance on broad knowledge benchmarks like MMLU

Concrete Example: A teacher draws a physics problem of a skier on a slope with messy handwriting. A standard text-only model cannot see it; a standard image-captioner misses the mathematical nuance. Gemini natively reads the handwriting, understands the physics problem, identifies a student's specific reasoning error, and outputs the correct solution in LaTeX.

Key Novelty

Natively Multimodal Joint Training

Models are trained from the start on a dataset containing interleaved text, images, audio, and video, rather than training a text model and grafting on vision encoders later
Outputs can be natively interleaved text and images (using discrete image tokens), allowing for diverse generative tasks beyond just text responses
Audio is ingested directly as signals at 16kHz via Universal Speech Model features, preserving nuances lost when converting audio to text first

Evaluation Highlights

Gemini Ultra achieves 90.04% on MMLU (Massive Multitask Language Understanding), becoming the first model to exceed the human-expert score of 89.8%
Achieves state-of-the-art on 30 of 32 benchmarks evaluated, including 10/12 text/reasoning, 9/9 image understanding, and 6/6 video understanding benchmarks
On the MMMU multimodal reasoning benchmark, Gemini Ultra scores 62.4%, outperforming the previous state-of-the-art (GPT-4V) by over 5 percentage points

Breakthrough Assessment

10/10

Sets new SOTA on nearly every major benchmark (text, code, multimodal). First to crack human-expert MMLU performance. Natively multimodal architecture represents a significant shift from modular approaches.

⚙️ Technical Details

Problem Definition

Setting: General-purpose multimodal modeling handling interleaved sequences of text, image, audio, and video inputs

Inputs: Interleaved sequences of text, natural images, charts, screenshots, PDFs, videos, and audio (16kHz)

Outputs: Interleaved text and images (via discrete image tokens)

Pipeline Flow

Multimodal Input Processing (Text, Image, Audio, Video)
Transformer Decoder (Joint Processing)
Interleaved Output Generation (Text + Image)

System Modules

Visual Encoder (Input Processing)

Process visual inputs (images, video frames) inspired by Flamingo, CoCa, and PaLI

Model or implementation: Not explicitly specified (referenced Flamingo/CoCa/PaLI inspiration)

Audio Encoder (Input Processing)

Ingest audio signals directly to capture nuances lost in speech-to-text

Model or implementation: Universal Speech Model (USM)

Transformer Decoder

Process interleaved multimodal tokens in a single sequence

Model or implementation: Enhanced Transformer decoder with efficient attention (e.g., multi-query attention)

Novel Architectural Elements

Native multimodality from the beginning: trained jointly across all modalities rather than grafting encoders onto a text LLM
Native image output generation using discrete image tokens
Direct ingestion of 16kHz audio features from USM rather than text transcription

Modeling

Base Model: Transformer decoders (Gemini Ultra, Pro, Nano-1, Nano-2)

Training Method: Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)

Training Data:

Web documents, books, code
Image, audio, and video data
Filtered for quality and safety; evaluation data removed (decontamination)

Key Hyperparameters:

context_length: 32,768 (32k)
model_sizes: Nano-1 (1.8B), Nano-2 (3.25B), Pro (unspecified), Ultra (unspecified)

Compute: Trained on TPUv5e and TPUv4 accelerators (Ultra used fleet of TPUv4 SuperPods)

📊 Experiments & Results

Evaluation Setup

Comprehensive suite of internal and external benchmarks covering text, code, image, audio, and video. Comparison against GPT-4, PaLM 2, and other SOTA models.

Benchmarks:

MMLU (General knowledge & reasoning (57 subjects))
MMMU (Multimodal reasoning (college-level))
GSM8K (Grade-school math)
HumanEval (Python code generation)
MathVista (Mathematical reasoning on images)

Metrics:

Accuracy
Pass@1
BLEURT (for translation)
WER (Word Error Rate - implied/standard for ASR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Text and Reasoning benchmarks show Gemini Ultra surpassing GPT-4 and reaching human-expert levels on MMLU.
MMLU	Accuracy	86.4	90.04	+3.64
GSM8K	Accuracy	92.0	94.4	+2.4
MATH	Accuracy	52.9	53.2	+0.3
HumanEval	Pass@1	67.0	74.4	+7.4
Multimodal benchmarks demonstrate significant leads in image and video understanding.
MMMU (val)	Pass@1	56.8	59.4	+2.6
TextVQA (val)	Accuracy	78.0	82.3	+4.3
MathVista (testmini)	Accuracy	49.9	53.0	+3.1

Main Takeaways

Gemini Ultra is the first model to surpass human-expert performance (89.8%) on the MMLU benchmark with 90.04%.
The model demonstrates strong native multimodal capabilities, outperforming GPT-4V on all measured multimodal benchmarks (MMMU, TextVQA, DocVQA, etc.) without needing external OCR tools.
Nano models (1.8B and 3.25B) show surprising competence, outperforming much larger models on certain reasoning and factuality tasks relative to their size.
Multilingual performance is robust, with Gemini Ultra achieving best-in-class translation quality (avg BLEURT 74.8) on out-of-English WMT 23 tasks.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Decoder-only)
Multimodal learning (joint embedding spaces)
Reinforcement Learning from Human Feedback (RLHF)

Key Terms

MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects like math, history, and law to test general knowledge and problem solving

MMMU: Massive Multi-discipline Multimodal Understanding—a benchmark requiring college-level subject knowledge to answer questions about images

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer

RLHF: Reinforcement Learning from Human Feedback—a method to align model behavior with human preferences using reward models

TPU: Tensor Processing Unit—Google's custom application-specific integrated circuit (ASIC) for machine learning

GSM8K: Grade School Math 8K—a dataset of high-quality linguistically diverse grade school math word problems

BLEURT: A learned evaluation metric for natural language generation (like translation) that correlates with human judgment

USM: Universal Speech Model—a family of speech models used here to encode audio features for Gemini

visual encoding: Converting visual data (images/video) into vector representations the model can process

discrete image tokens: Representing image parts as discrete codes from a vocabulary, allowing the model to generate images like it generates text words

context window: The amount of text/data a model can consider at one time (here, 32k tokens)