Lu Wang, Zhuoran Jin, Yupu Hao, Yubo Chen, Kang Liu, Yulong Ao, Jun Zhao
Institute of Automation, Chinese Academy of Sciences,
Beijing Academy of Artificial Intelligence (BAAI)
arXiv
(2026)
MMMemoryReasoningBenchmark
📝 Paper Summary
Streaming Video UnderstandingMultimodal Large Language Models (MLLMs)
Think While Watching enables continuous online video reasoning by maintaining persistent segment-level memory notes and decoupling visual perception from text generation to eliminate serialization bottlenecks.
Core Problem
Existing streaming MLLMs interleave perception and generation, causing 'memory erosion' where early context is lost during long interactions, and a 'serialization bottleneck' where text generation blocks video ingestion.
Why it matters:
Real-world assistants (live broadcasting, robotics) must answer questions instantly without pausing the video stream or forgetting early visual evidence.
Interleaved processing accumulates latency over time because the model stops 'watching' to 'think', making it unscalable for long-duration streams.
Naive streaming approaches suffer from catastrophic forgetting in multi-turn dialogues, failing to link current queries to much earlier events.
Concrete Example:In a magic show video, if a user asks about the first trick after 10 minutes of streaming, an interleaved model typically forgets the initial visual details or gets confused about who 'the first person' refers to because it only optimizes for immediate context.
Key Novelty
Think While Watching (TWW) Framework
Treats video as a sequence of segments, writing a concise textual 'memory note' for each segment that persists in a memory bank throughout the stream.
Decouples the visual input stream from the text output stream using independent positional encodings, allowing the model to ingest new frames while simultaneously generating answers.
Uses a specialized three-stage training strategy (single-round, multi-round, long-range) to teach the model to write informative notes and retrieve them later.
Architecture
Comparison between Interleaved processing and Think While Watching (TWW). Shows how TWW processes segments (SEG) to create Memory notes and handles Questions (Q) in parallel.
Evaluation Highlights
Maintains accuracy while reducing output tokens by 56% in multi-round settings compared to the Qwen3-VL-Thinking baseline on StreamingBench.
Improves single-round accuracy by 3.79% on OVO-Bench (55.02% vs 51.23%) using Qwen3-VL-4B.
Reduces Time-to-First-Token (TTFT) by 92.6% compared to offline batch processing (2304 vs 31203 tokens latency) while matching accuracy.
Breakthrough Assessment
8/10
Significantly addresses the latency and memory decay issues in streaming video LLMs. The decoupling of perception/generation and the explicit memory note mechanism provide a practical, efficient solution for long-form video understanding.
⚙️ Technical Details
Problem Definition
Setting: Online multi-turn video question answering where video segments arrive sequentially S1...ST
Inputs: Stream of video segments S_t and questions q_r arriving at arbitrary timestamps
Outputs: Answer a_r generated using only observed history (strict causality), plus continuous memory notes m_t for every segment
Pipeline Flow
Group: Perception: Video Segment Input -> Memory Note Generation
Processes continuous video segments and generates a concise textual summary (memory note) for each
Model or implementation: Qwen3-VL (shared backbone)
Streaming Mask (Interaction)
Enforces strict causality, ensuring answers only attend to past segments and generated notes
Model or implementation: Attention Mask Matrix
Decoupled Decoder (Interaction)
Generates answers using decoupled positional encodings to allow parallel watching and thinking
Model or implementation: Qwen3-VL (shared backbone)
Novel Architectural Elements
Decoupled Streaming MRoPE: Independent positional encoding streams for input (video/questions) and output (answers) to enable parallel processing
Segment-Level Streaming Causal Mask: A specialized attention mask that allows access to all prior segments/notes but strictly blocks future information
Modeling
Base Model: Qwen3-VL-Instruct (2B, 4B, and 8B variants)
Training Method: Supervised Fine-Tuning (SFT) in three stages
Training Data:
Stage 1: 5,160 single-round instances from VideoChatOnline-IT (learns to write memory notes)
Stage 2: 2,752 multi-round dialogues from VideoChatOnline-IT (learns multi-turn consistency)
Stage 3: 1,500 long videos from YouTube (learns long-range memory and distractor robustness)
Compute: Not reported in the paper
Comparison to Prior Work
vs. VideoLLM-online: TWW decouples perception/generation via parallel position encoding, avoiding the serialization bottleneck.
vs. StreamChat: TWW uses explicit segment-level memory notes rather than relying solely on hidden states or raw token history.
vs. LiveCC [not cited in paper]: Similar focus on streaming, but TWW emphasizes the 'memory note' abstraction specifically for multi-turn reasoning.
Limitations
Performance degrades if the segment duration is too short (context fragmentation) or too long (latency increases).
Requires ground-truth timestamps for questions during evaluation.
Memory notes are textual approximations; extremely subtle visual details might be lost during the summarization process.
Code is publicly available at GitHub. The paper details the three-stage data construction process using GPT-4o but does not explicitly link to the generated dataset files.
📊 Experiments & Results
Evaluation Setup
Streaming video question answering under single-round and multi-round protocols.
Benchmarks:
StreamingBench (Streaming video understanding (spatial, temporal, reasoning))
OVO-Bench (Real-world online video understanding)
Metrics:
Accuracy (%)
Avg Tokens (Output length)
Time To First Token (TTFT)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
StreamingBench
Overall Accuracy
58.52
57.40
-1.12
StreamingBench
Avg Tokens
689.22
302.56
-386.66
OVO-Bench
Overall Accuracy
50.70
51.80
+1.10
StreamingBench
TTFT (Tokens)
31203.69
2304.28
-28899.41
Main Takeaways
Naive streaming (interleaved) collapses in multi-turn settings (e.g., ~18% accuracy vs 57% for TWW), proving the necessity of specialized memory mechanisms.
The 'Thinking' capability of Qwen3-VL can be effectively adapted to streaming via memory notes, maintaining high reasoning performance with fewer tokens.
Long-video training (Stage 3) is crucial for shifting attention from recent frames to long-term history, enabling the model to recall early events.
Decoupled inference architecture allows the model to process input and output in parallel, significantly reducing real-world latency compared to serial interleaved approaches.
📚 Prerequisite Knowledge
Prerequisites
Multimodal Large Language Models (MLLMs)
Transformer attention mechanisms (causal masking)
KV Caching for efficient inference
Key Terms
Memory Erosion: The tendency of models to forget early visual cues or context as the video stream and dialogue history grow longer
Serialization Bottleneck: Performance degradation caused when a model must pause video ingestion to generate text, leading to accumulating lag
MRoPE: Multimodal Rotary Positional Embeddings—a technique to encode position information for both text and image tokens
Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer
KV Cache: Key-Value Cache—storing computed attention keys and values to speed up autoregressive generation
TTFT: Time To First Token—latency metric measuring the time from input receipt to the start of the response generation
Dual KV Cache: An engineering pattern maintaining separate caches for encoding (perception) and decoding (generation) to allow parallel execution