Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

📝 Paper Summary

Text-to-Image Generation Multi-modal Language Modeling Instruction Tuning

CM3Leon applies text-only language modeling recipes—retrieval-augmented pretraining and instruction tuning—to a decoder-only multi-modal architecture, achieving state-of-the-art image generation efficiency and performance.

Core Problem

Diffusion models dominate image generation but lack text-reasoning capabilities, while traditional autoregressive image models are computationally expensive to train and hard to control via complex instructions.

Why it matters:

Autoregressive models offer better global image coherence and can handle both text and image generation tasks within a single model, unlike diffusion
Current token-based models require massive compute (e.g., PARTI-20B) to match diffusion quality
Ethical concerns regarding image data sourcing are prevalent; this work proves SOTA results are possible using only licensed (Shutterstock) data

Concrete Example: A user wants to 'Edit the image following the text instruction' by providing a photo of a woman and the text 'Make her an alien'. Standard autoregressive models struggle to preserve the original structure while following the edit. CM3Leon, via instruction tuning, successfully generates the edit while maintaining the original pose.

Key Novelty

CM3Leon (Chameleon)

First multi-modal model to successfully adapt the 'text-only' recipe of large-scale retrieval-augmented pretraining followed by multi-task supervised fine-tuning (SFT)
Introduces a self-contained Contrastive Decoding (CD-K) method for image generation that subtracts unconditional logits from conditional ones to improve quality without external classifiers
Proves that retrieval augmentation allows autoregressive models to be extremely training-efficient (5x less compute than comparable models)

Architecture

Conceptual flow of CM3Leon's capabilities including text-to-image and structure-guided editing.

Evaluation Highlights

Achieves zero-shot MS-COCO FID of 4.88 (CM3Leon-7B), setting a new state-of-the-art among autoregressive models
Outperforms Google's PARTI-20B (FID 7.23) while using 5x less training compute
Instruction-tuned model (SFT-CM3Leon-7B) achieves 61.6 CIDEr on MS-COCO Captioning (zero-shot), comparable to larger vision-language models

Breakthrough Assessment

9/10

Significantly shifts the paradigm by showing autoregressive models can beat diffusion in efficiency if trained like LLMs (retrieval + SFT), achieving SOTA with far less compute.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal sequence modeling (text and image tokens) using a decoder-only architecture

Inputs: Interleaved sequences of text tokens and discrete image tokens (flattened 256x256 images)

Outputs: Next-token prediction (continuation of text or image tokens)

Pipeline Flow

Input Processing: Tokenize text and image (Gafni et al. tokenizer)
Retrieval: Use CLIP bi-encoder to find 2 relevant multi-modal documents from memory bank
Prepend: [Retrieved Doc 1] [Retrieved Doc 2] [Current Context]
Generation: CM3 Transformer predicts next tokens autoregressively
Decoding: Use Contrastive Decoding (CD-K) or Classifier-Free Guidance

System Modules

Image Tokenizer

Encodes 256x256 images into 1024 discrete tokens

Model or implementation: Gafni et al. (2022) tokenizer

Dense Retriever

Retrieve relevant image-text pairs from memory bank to augment context

Model or implementation: CLIP ViT-B-32 bi-encoder

CM3Leon Transformer

Generates/Infills text and image tokens

Model or implementation: Decoder-only Transformer (350M, 760M, or 7B parameters)

Novel Architectural Elements

Contrastive Decoding TopK (CD-K): A decoding-time modification that subtracts unconditional logits from conditional logits (similar to CFG but formulated as contrastive decoding)
Retrieval-Augmented CM3 training recipe scaled to 7B parameters with exclusively licensed data

Modeling

Base Model: CM3Leon-7B (Decoder-only Transformer)

Training Method: Supervised Fine-Tuning (Instruction Tuning)

Objective Functions:

Purpose: Predict the next token in the sequence (standard language modeling).

Formally: -log p(x_input)
Purpose: Mask specific spans and move them to the end for infilling (CM3 objective).

Formally: Transforms input x into masked version and predicts moved spans autoregressively.

Adaptation: Full fine-tuning on mixed tasks

Trainable Parameters: 7 Billion

Training Data:

Pretraining: Shutterstock (licensed image-text pairs). 3 billion text tokens.
SFT: Mixed dataset of 30B tokens including InstructPix2Pix, OCR, Object Detection, VQA, Captioning.

Key Hyperparameters:

sequence_length: 4096
batch_size: 8M tokens
learning_rate: 1.2e-4 (peak for 7B)
+ 3 more
warmup_steps: 1500
weight_decay: Not explicitly reported in the paper
clip_grad_norm: Not explicitly reported in the paper

Compute: Pretraining: 2.4T tokens on 512 GPUs (A100s implied by Fig 2). SFT: 30B tokens on 128 A100s (80GB). Inference latency: 11.8s for 256x256 image (BF16) on 7B model.

Comparison to Prior Work

vs. PARTI: CM3Leon uses decoder-only (vs encoder-decoder) and retrieval augmentation, achieving better FID with 5x less compute
vs. Stable Diffusion: CM3Leon is autoregressive (token-based), allowing simpler text generation and infilling, vs diffusion's noise reversal
vs. RA-CM3: CM3Leon scales to 7B, uses only licensed data, and adds an Instruction Tuning (SFT) stage [not cited in paper as difference, but evolution]

Limitations

Inference speed is slower than non-autoregressive methods (e.g., MUSE) due to sequential token generation (11.8s vs 0.5s)
Resolution limited to 256x256 in primary experiments (though tokenization allows scaling)
Relies on external retrieval index which adds complexity to the inference pipeline

Reproducibility

Code: https://github.com/facebookresearch/metaseq

Code repository linked (metaseq), but specific model weights and pre-processed Shutterstock data are not explicitly claimed to be released (likely due to licensing). SFT task templates are provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Zero-shot Text-to-Image generation and Vision-Language tasks

Benchmarks:

MS-COCO (30K) (Zero-shot Text-to-Image Generation)
VizWiz (Visual Question Answering)
OKVQA (Visual Question Answering (Knowledge-based))

Metrics:

FID (Fréchet Inception Distance)
CIDEr (Image Captioning)
Accuracy (VQA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Text-to-Image Generation Performance: CM3Leon sets new SOTA for autoregressive models on MS-COCO.
MS-COCO (30K)	FID	7.23	4.88	-2.35
MS-COCO (30K)	FID	5.25	4.88	-0.37
MS-COCO (30K)	FID	12.60	4.88	-7.72
Vision-Language Task Performance: SFT allows the model to perform captioning and VQA competently.
VizWiz	Accuracy (test-dev)	28.8	37.6	+8.8
VQA2	Accuracy (test-dev)	51.8	47.6	-4.2

Experiment Figures

Plot of FID score (y-axis, log scale) vs. Equivalent A100 GPU Hours (x-axis) for various models.

Ablation study of decoding strategies: Left shows CFG weight impact; Right shows Sample Count vs FID for TopP and CD-K.

Main Takeaways

Retrieval augmentation significantly improves data efficiency: CM3Leon matches or beats models trained on datasets orders of magnitude larger.
Instruction tuning (SFT) is highly effective for multi-modal models, enabling precise control (e.g., 'Edit the image...') that plain pre-trained models lack.
The proposed Contrastive Decoding (CD-K) strategy consistently improves generation quality compared to standard Classifier-Free Guidance alone (Figure 4).
Autoregressive models scale favorably: Figure 2 shows CM3Leon's FID improves steadily with compute, outperforming Diffusion scaling curves.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Decoder-only)
Vector Quantization (Image Tokenization)
Retrieval-Augmented Generation (RAG)
Contrastive Decoding

Key Terms

FID: Fréchet Inception Distance—a metric for image generation quality where lower scores indicate images are closer to real data distributions

CM3: Causal Masked Multi-Modal model—an architecture capable of causal masking (standard autoregressive) and masked infilling

SFT: Supervised Fine-Tuning—training a pre-trained model on specific task instructions (e.g., 'Edit this image') to improve controllability

Autoregressive: A generation method where the model predicts the next token based on all previous tokens in the sequence

CLIP: Contrastive Language-Image Pre-training—a model used here to encode query images/text for retrieving relevant documents

Contrastive Decoding: A decoding strategy that maximizes the difference between a 'strong' model (conditioned on text) and a 'weak' model (unconditional) to improve quality

CFG: Classifier-Free Guidance—a sampling technique that pushes generation towards a conditional prompt and away from a generic/unconditional prompt

Zero-shot: Evaluating a model on a task it was not explicitly trained for (e.g., generating an image from a caption without seeing that specific pair)

CIDEr: A metric for image captioning quality based on human consensus

Dense Retriever: A retrieval system that finds relevant documents using vector embeddings rather than keyword matching