A compoehensive survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

📝 Paper Summary

Generative AI AI-Generated Content (AIGC) Multimodal Generation

This survey provides a comprehensive roadmap of AI-Generated Content, tracing the evolution from early GANs to modern Large Language Models and multimodal systems, classifying key techniques and applications.

Core Problem

The rapid emergence of diverse generative models (ChatGPT, DALL-E 2) has created a fragmented landscape, making it difficult to understand the underlying connections, historical evolution, and common foundations of AIGC.

Why it matters:

AIGC is reshaping industries like art, advertising, and education by automating high-quality content creation.
Understanding the shift from unimodal to multimodal generation is critical for future research directions.
Identifying open problems (like safety and reasoning) is necessary to guide the next phase of generative AI development.

Concrete Example: Prior to comprehensive surveys, the connection between unrelated fields like GANs in computer vision and Transformers in NLP was unclear, obscuring how they converged into modern multimodal models like CLIP or DALL-E 2.

Key Novelty

Unified AIGC Taxonomy

Classifies generative models into unimodal (text-to-text, image-to-image) and multimodal (cross-modal generation) categories.
Identifies the Transformer architecture as the convergence point where computer vision and natural language processing distinct paths merged.
Highlights the role of Reinforcement Learning from Human Feedback (RLHF) in aligning generative outputs with human intent.

Architecture

Overview of AIGC workflow distinguishing Unimodal vs Multimodal models

Evaluation Highlights

Reviews the transition from small-scale models (GPT-2, 1.5B parameters) to large foundation models (GPT-3, 175B parameters), enabling better generalization.
Contrasts training speeds across hardware, noting NVIDIA A100 GPUs achieve 7x faster BERT-large inference compared to V100s.
Summarizes the shift in computer vision from GAN dominance to Diffusion models (e.g., DALL-E 2) for higher stability and resolution.

Breakthrough Assessment

8/10

A timely and extensive literature review that organizes the chaotic explosion of generative AI into a structured history and taxonomy, though it is a survey rather than a new method.

⚙️ Technical Details

Problem Definition

Setting: Generative tasks where a model learns a data distribution p(x) or conditional distribution p(y|x) to generate new content.

Inputs: Human instructions (prompts) which can be unimodal (text-only) or multimodal (text + image).

Outputs: Digital content satisfying the instruction, such as text, images, code, or audio.

Pipeline Flow

Input Instruction (Prompt)
Intent Extraction (Encoder)
Generation (Decoder)
Output Content

System Modules

Foundation Model

Extract intent and generate content based on vast pre-training

Model or implementation: Transformer-based (e.g., GPT-3, ViT, CLIP)

Reward Model

Evaluate generated content against human preference

Model or implementation: Learned Reward Function

Novel Architectural Elements

Convergence of CV and NLP architectures onto the Transformer backbone, enabling unified multimodal modeling.

Modeling

Base Model: Various (Survey covers GPT-3, DALL-E 2, Stable Diffusion, etc.)

Training Method: Reinforcement Learning from Human Feedback (RLHF) for dialogue models

Objective Functions:

Purpose: Predict the next token in a sequence.

Formally: Autoregressive Language Modeling (maximize P(w_t | w_{1:t-1})).
Purpose: Distinguish real images from fake ones.

Formally: GAN Minimax Loss (min_G max_D V(D, G)).
Purpose: Match image and text representations.

Formally: Contrastive Loss (e.g., CLIP's objective).

Adaptation: Fine-tuning, Prompting, or RLHF depending on the specific model variant.

Training Data:

WebText (38GB) for GPT-2
CommonCrawl (570GB filtered) for GPT-3

Compute: Not reported in the paper

Comparison to Prior Work

vs. RNNs: Transformers enable massive parallelization and long-context modeling.
vs. GANs: Diffusion models offer more stable training and mode coverage [not cited in paper, but implied context].
vs. Unimodal models: Multimodal models (CLIP, DALL-E) leverage semantic knowledge from text to guide image generation.

Limitations

High computational cost for training foundation models (e.g., GPT-3).
Risk of generating biased, untruthful, or harmful content.
Lack of interpretability in large black-box models.
RLHF relies on expensive and potentially subjective human labeling.

Reproducibility

No replication artifacts mentioned in the paper (Survey paper).

📊 Experiments & Results

Evaluation Setup

Qualitative review and taxonomy construction; no single experimental benchmark.

Metrics:

Statistical methodology: Not explicitly reported in the paper

Experiment Figures

A timeline history of Generative AI from 2014 to 2023 across NLP, CV, and Vision-Language (VL) domains.

Main Takeaways

The Transformer architecture has unified Generative AI, becoming the standard for both text (GPT) and vision (ViT).
Scale is a primary driver of performance: shifting from 1.5B (GPT-2) to 175B (GPT-3) parameters significantly improved generalization.
The field has shifted from unimodal generation (text-only or image-only) to complex multimodal interactions (text-to-image, vision-language).
RLHF is a critical component for aligning raw generative capabilities with helpfulness and safety in conversational agents.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Deep Learning architectures (CNNs, RNNs)
Familiarity with Probability Distributions (for GANs/VAEs)
Concepts of Self-Attention and Transformers

Key Terms

AIGC: AI-Generated Content—digital content (images, text, audio) created by AI models rather than human authors.

GAN: Generative Adversarial Network—a framework where a generator creates fake data and a discriminator tries to distinguish it from real data.

VAE: Variational Autoencoder—a generative model that learns a probabilistic latent space to reconstruct inputs.

RLHF: Reinforcement Learning from Human Feedback—fine-tuning method where a model optimizes a reward function derived from human preferences.

Transformer: A deep learning architecture based on self-attention mechanisms, serving as the backbone for modern LLMs and vision models.

Diffusion Model: A generative model that creates data by learning to reverse a gradual noise-addition process.

Zero-shot learning: The ability of a model to perform a task it wasn't explicitly trained on, often via prompting.

Multimodal: Involving multiple types of data (modalities) such as text and images simultaneously.

Autoregressive: A property of models that generate sequences one token at a time, based on previously generated tokens.