← Back to Paper List

Automated Movie Generation via Multi-Agent CoT Planning

Weijia Wu, Zeyu Zhu, Mike Zheng Shou
National University of Singapore
arXiv.org (2025)
Agent MM Reasoning

📝 Paper Summary

Long-form video generation Multi-agent systems Automated filmmaking
MovieAgent is a multi-agent framework that automates long-form movie generation by simulating a human film crew (director, screenwriter, etc.) to hierarchically plan scripts, scenes, and shots with consistent characters and audio.
Core Problem
Existing video generation models focus on short clips and lack high-level planning, resulting in long-form videos with incoherent narratives, inconsistent characters, and no logical scene structure.
Why it matters:
  • Manual creation of movies requires high costs (millions of dollars) and long production times (years), whereas AI automation offers near-zero cost.
  • Current state-of-the-art models like Sora generate high-quality short clips but fail to maintain narrative coherence or character consistency over longer durations.
  • Previous long-video attempts lack the hierarchical reasoning of real filmmaking, failing to handle complex multi-scene structures and synchronized audio.
Concrete Example: Current models might generate a 5-second clip of a person walking, but if asked to generate a 5-minute story about that person, the character's face would change between shots, the audio would desynchronize, and the plot would wander illogically.
Key Novelty
Hierarchical Multi-Agent CoT Planning for Filmmaking
  • Simulates a professional film crew by assigning specific roles (Director, Scene Planner, Shot Planner) to different AI agents that work collaboratively.
  • Uses Chain-of-Thought (CoT) reasoning to break down abstract scripts into concrete sub-scripts, scene descriptions, and precise shot parameters (camera angle, lighting).
  • Decouples the generation process into planning (script/scene/shot) and execution (video/audio synthesis), ensuring logical flow before pixel generation.
Architecture
Architecture Figure Figure 2
The overall framework of MovieAgent, illustrating the hierarchical flow from Script to Video.
Evaluation Highlights
  • Achieves state-of-the-art results in script faithfulness, character consistency, and narrative coherence compared to existing frameworks like StoryAgent and DreamFactory.
  • Significantly reduces production costs to near-zero compared to traditional filmmaking which requires millions of dollars.
  • Demonstrates capability to generate multi-scene, multi-shot videos with synchronized subtitles and stable audio, addressing a major gap in current video generation.
Breakthrough Assessment
8/10
While dependent on underlying video generation models, the hierarchical multi-agent framework significantly advances long-form coherence and structure, moving beyond simple clip concatenation toward actual storytelling.
×