DiT: Diffusion Transformer—a diffusion model architecture based on Transformers rather than the traditional U-Net
MSRVTT-Personalization: A new benchmark proposed in this paper for evaluating multi-subject video personalization, derived from the MSR-VTT dataset
copy-and-paste effect: A failure mode where the model simply replicates the reference image pixels in the output video instead of generating new poses or lighting
Rectified Flow: A generative modeling framework used here for training the denoising network, connecting noise and data distributions with straight paths
RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers that generalizes well to different sequence lengths
SAM: Segment Anything Model—used in the data pipeline to mask out subjects and backgrounds
GroundingDINO: An open-set object detector used to locate subjects in training videos based on text descriptions
binding: The mechanism of explicitly associating visual features from a reference image with the specific text token representing that object
open-set personalization: The ability to personalize concepts (objects, people) that were not seen during training, without requiring fine-tuning