DiT: Diffusion Transformer—a type of diffusion model that uses Transformer architecture instead of the traditional U-Net for denoising
VAE: Variational Autoencoder—a neural network that compresses images into a smaller latent space (tokens) for efficient processing
VLM: Vision Language Model—a model that can understand and generate content based on both visual and textual inputs
RLHF: Reinforcement Learning from Human Feedback—a training method that fine-tunes models based on human preferences to align outputs with user intent
SFT: Supervised Fine-Tuning—training the model on high-quality labeled datasets to improve specific capabilities like artistic style or instruction following
ADP: Adversarial Distillation Post-training—a method to initialize the model for fast sampling by using a hybrid discriminator
ADM: Adversarial Distribution Matching—a fine-tuning step using a learnable diffusion-based discriminator to match complex data distributions for high-quality few-step generation
NFE: Number of Function Evaluations—the number of times the model must run its neural network to generate a single image; lower is faster
Quantization: Reducing the precision of model numbers (e.g., from 16-bit to 4-bit) to speed up calculation and reduce memory usage
Speculative Decoding: An acceleration technique where a smaller 'draft' model predicts tokens that are verified by the larger model, speeding up generation
CT: Continuing Training—an intermediate training stage to broaden foundational knowledge before fine-tuning
GSB: A metric likely referring to General Score Benchmark or similar composite metric used in MagicBench (exact acronym definition not explicitly detailed in text, but context implies overall quality)