OSCBench: Benchmarking Object State Change in Text-to-Video Generation

📝 Paper Summary

Text-to-Video Generation Video Understanding Benchmarks

OSCBench is a benchmark designed to evaluate whether text-to-video models correctly render object state changes (e.g., slicing a lemon) across regular, novel, and compositional scenarios.

Core Problem

Current text-to-video (T2V) models generate high-quality visuals but often fail to faithfully realize the consequences of actions, specifically the transformation of an object from an initial to a target state.

Why it matters:

Existing benchmarks focus on physical plausibility or general alignment but overlook explicit Object State Change (OSC), which is critical for instructional video generation and embodied AI
Models may produce realistic motion patterns while failing to render the actual state transition (e.g., an object appearing intact after a 'chopping' action)
Correctly modeling OSC requires deep language-grounded reasoning to infer intended transformations, which current models struggle to generalize

Concrete Example: When prompted with 'slicing a lemon', a model might generate a video of a knife moving near a lemon, but the lemon remains whole (incorrect state change) or transforms implausibly.

Key Novelty

OSCBench (Object State Change Benchmark)

Constructs a dataset of instructional cooking scenarios abstracting actions and objects into categories to ensure diversity and avoid long-tail bias
Organizes evaluation into three difficulty regimes: Regular (common pairs), Novel (uncommon but feasible pairs), and Compositional (sequences of actions) to test generalization
Employs a Chain-of-Thought (CoT) evaluation strategy using Multimodal Large Language Models (MLLMs) to mimic fine-grained human reasoning about state evolution

Architecture

The construction pipeline of OSCBench, illustrating the abstraction of raw data into categories and the generation of diverse scenarios.

Evaluation Highlights

Benchmark comprises 1,120 prompts across 140 distinct object-state scenarios
Includes 20 'Novel' scenarios specifically designed to test generalization to uncommon action-object pairs (e.g., peeling berries)
Evaluation covers 6 State-of-the-Art models including proprietary systems like Kling-2.5-Turbo and Veo-3.1-Fast

Breakthrough Assessment

7/10

Addresses a specific, high-value failure mode in T2V generation (state changes). The categorization into regular/novel/compositional is methodologically sound, though the paper is primarily a benchmark contribution rather than a modeling breakthrough.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Text-to-Video generation models on specific action-consequence alignment

Inputs: Text prompt describing a subject, action, object, and scene (e.g., '<subject><action><object><scene>')

Outputs: Generated video depicting the object state change implied by the action

Pipeline Flow

Data Abstraction (Action/Object categorization via GPT-5.2/Gemini-3)
Scenario Construction (Regular, Novel, Compositional generation)
Video Generation (Target T2V Models)
Evaluation (Human + MLLM Chain-of-Thought)

System Modules

Data Abstraction (Benchmark Construction)

Abstract 20 action and 134 object elements from HowToChange into high-level categories to mitigate long-tail bias

Model or implementation: GPT-5.2 and Gemini-3 (assisted by human verification)

Scenario Generator (Benchmark Construction)

Generate specific evaluation scenarios (Regular, Novel, Compositional)

Model or implementation: ChatGPT (filtering) + Human Review

Prompt Generator (Benchmark Construction)

Create natural language prompts for each action-object combination

Model or implementation: GPT-5.2

MLLM Evaluator

Automatically assess generated videos using Chain-of-Thought reasoning

Model or implementation: Unspecified MLLMs (likely GPT-4o or similar, paper refers to 'latest MLLMs')

Novel Architectural Elements

Three-tiered evaluation regime (Regular, Novel, Compositional) specifically for object state changes
Chain-of-Thought evaluation protocol for MLLMs that enforces 'Criteria grounding' before scoring to improve reliability

Comparison to Prior Work

vs. VBench: OSCBench focuses strictly on action-induced object state changes rather than general visual quality
vs. PhyWorldBench: OSCBench evaluates specific object transformations (e.g., sliced) rather than general physics laws (e.g., gravity)
vs. HowToChange: OSCBench abstracts the raw data into balanced categories and introduces 'Novel' and 'Compositional' settings for stress-testing generalization

Limitations

Evaluation relies on MLLMs which may have their own visual understanding biases
Focuses primarily on cooking scenarios, potentially limiting applicability to other domains
Human evaluation is costly and done on a sampled subset (140 videos per model) rather than the full set

Reproducibility

Code: https://hanxjing.github.io/OSCBench

Benchmark data and code available at https://hanxjing.github.io/OSCBench. The paper constructs the benchmark using GPT-5.2 and Gemini-3. Specific prompts for the MLLM evaluation are provided in Appendix C (referenced in text).

📊 Experiments & Results

Evaluation Setup

Benchmarking 6 T2V models using 1,120 prompts derived from cooking tasks. Evaluated by humans and MLLMs.

Benchmarks:

OSCBench (Text-to-Video Generation) [New]

Metrics:

Semantic Adherence (Subject, Object, Action alignment)
Object State Change (Accuracy, Consistency)
Scene Alignment
Perceptual Quality (Realism, Aesthetics)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The provided paper text is truncated before the quantitative results table. The following entries reflect dataset statistics reported in the methodology sections.
OSCBench	Total Prompts	0	1120	0
OSCBench	Total Scenarios	0	140	0
OSCBench	Average Prompt Length (words)	0	9.2	0

Experiment Figures

Comparison of success and failure cases in T2V generation regarding Object State Change.

Main Takeaways

Current T2V models (including Sora, Hunyuan, Kling) struggle significantly with Object State Change (OSC) despite high performance on static semantic alignment.
Models often fail to maintain temporal consistency of the state change, with objects sometimes reverting state or changing abruptly without cause.
Novel and Compositional scenarios present a much higher challenge than Regular scenarios, indicating poor generalization of action-consequence reasoning.
The Chain-of-Thought (CoT) strategy for MLLM evaluation is proposed to align better with fine-grained human judgment on state changes.

📚 Prerequisite Knowledge

Prerequisites

Text-to-Video (T2V) generation basics
Multimodal Large Language Models (MLLMs) for evaluation
Chain-of-Thought (CoT) prompting

Key Terms

OSC: Object State Change—the transformation of an object's physical state induced by an action (e.g., whole -> sliced)

T2V: Text-to-Video—generative models that create video content from textual descriptions

MLLM: Multimodal Large Language Model—AI models capable of processing and reasoning over both text and image/video inputs

CoT: Chain-of-Thought—a prompting strategy that encourages models to articulate intermediate reasoning steps before giving a final answer

HowToChange: A dataset derived from instructional cooking videos in HowTo100M, used as the foundation for OSCBench

Compositional Scenario: Evaluation settings involving multiple sequential actions (e.g., peeling then slicing) to test temporal consistency