Rectified Flows: A generative model class that learns straight paths between noise and data distributions, often allowing for fewer sampling steps than standard diffusion
VAE: Variational Autoencoder—a neural network that compresses data into a lower-dimensional latent space
FAD: Fréchet Audio Distance—a metric for evaluating the quality of generated audio by comparing its statistics to real audio
HiFiGen: A specific VAE and vocoder architecture used for high-fidelity audio generation
QFormer: Querying Transformer—a module that converts variable-length embeddings (like text from T5) into fixed-length latent representations
SFT: Supervised Fine-Tuning—training a model on labeled data
SD3: Stable Diffusion 3—a state-of-the-art image generation model using rectified flows
ImageReward: A metric trained on human preferences to evaluate image generation quality, considered more aligned with human judgment than FID