D2VLM: The proposed framework: Decoupled Learning for Temporally Grounded Video-Language Models.
FPO: Factorized Preference Optimization—an algorithm extending DPO to explicitly optimize both textual response and probabilistic temporal grounding.
evidence token (<evi>): A special token that not only marks a temporal event but explicitly aggregates visual features from salient video frames to serve as context.
interleaved text-evidence generation: Generating the final answer by mixing text tokens with evidence tokens that reference previously grounded events.
pure grounding: A preliminary generation stage where the model only outputs temporal evidence tokens before generating the full textual answer.
sub-video event: A semantically meaningful segment of a video (e.g., a specific action instance) used as the unit for perturbation in data synthesis.
DPO: Direct Preference Optimization—a stable method for aligning language models to preferences without a separate reward model.
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique.