ControlNet: A neural network structure that adds extra trainable layers to a pre-trained diffusion model to enable conditional control (e.g., via edge maps) without retraining the backbone
ST-ReFL: Spatio-Temporal Reward Feedback Learning—an algorithm proposed in this paper that optimizes the diffusion model using gradients from reward models scoring video quality and motion consistency
Optical flow: The pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene
Pixel residual: The difference in pixel values between consecutive video frames, used to identify static vs. moving regions
T2I-I2V: Text-to-Image-to-Image-to-Video—an inference pipeline where an initial image is generated first and then used as a condition to generate the subsequent video frames
Motion prior: Information derived from a source video (like flow or residuals) used to initialize the noise latents, ensuring they follow a realistic motion trajectory
MUSIQ: Multi-scale Image Quality Transformer—a metric/model used to evaluate the technical quality of images (sharpness, exposure, etc.)