Cocoa: Co-Planning and Co-Execution with AI Agents

📝 Paper Summary

Multi-turn w. user interactions Multi-task planning

Cocoa is a system that enables users to flexibly delegate tasks and interleave planning and execution with AI agents within a document editor, improving steerability in complex research workflows.

Core Problem

Current AI agent systems either rigidly separate planning from execution or force users into reactive error-correction roles, lacking flexibility for users to proactively guide the agent or adapt plans based on intermediate results.

Why it matters:

Rigid separation of planning and execution leads to wasted effort if initial plans are flawed
Reactive correction (fixing agent errors after they happen) is cognitively demanding and inefficient
Scientific research requires tacit knowledge and iterative refinement that fully autonomous agents currently lack

Concrete Example: A researcher planning a literature review might want to see initial search results for a specific query before deciding whether to broaden the search or dive deeper into a specific sub-topic. In current systems, they must wait for the full execution or manually interrupt the agent, whereas Cocoa allows them to execute the first step, inspect the output, and then modify the subsequent plan steps immediately.

Key Novelty

Interleaved Co-Planning and Co-Execution

Introduces a computational notebook-like interface within a text document where agent plans are represented as interactive cells
Allows explicit delegation of agency: users can assign specific steps to themselves or the agent
Enables fluid transition between planning and execution: users can execute a step, pause to edit future steps based on the output, and resume

Evaluation Highlights

Participants in a lab study (n=16) successfully used Cocoa to steer agents in research tasks, balancing control with ease of use compared to chat baselines
Field deployment (n=7, 1 week) showed researchers valued explicit delegation, using self-assigned steps to inject expert knowledge into the workflow
Qualitative feedback indicated that interleaved planning/execution allowed users to catch agent errors early and refine directions without restarting

Breakthrough Assessment

7/10

Significant contribution to human-agent interaction design by successfully adapting the notebook paradigm to general agentic workflows. While not an algorithmic breakthrough, it offers a strong, validated interaction model for steerability.

⚙️ Technical Details

Problem Definition

Setting: Human-AI collaborative workflow for complex, open-ended tasks (specifically scientific research)

Inputs: User's high-level research intent and ongoing document context

Outputs: Executed research actions (e.g., literature search summaries) and updated plans

Pipeline Flow

User invokes agent in document
Agent proposes interactive plan (list of steps)
User reviews/edits plan and assigns steps (Self vs. Agent)
Execution loop (Agent executes its steps, pauses for User steps)
User refines outputs or modifies remaining plan based on results

System Modules

Plan Generator

Generates initial plan steps based on user intent and document context

Model or implementation: Not explicitly reported in the paper

Execution Engine

Executes agent-assigned steps using available tools

Model or implementation: Not explicitly reported in the paper

Novel Architectural Elements

Dynamic Plan-as-Notebook UI: Agent plans are rendered as editable, executable cells within a standard text document
Explicit Agency Delegation: Each step has a toggle/assignment mechanism for 'User' vs 'Agent' execution

Modeling

Base Model: Not explicitly reported in the paper

Comparison to Prior Work

vs. Agent-guided: Cocoa allows users to modify the plan structure itself, not just answer agent questions
vs. User-guided: Cocoa's agent proactively proposes full plans that the user can then refine, reducing initial planning effort
vs. Chat-based Assistants: Cocoa integrates planning and execution in a shared artifact (document) rather than a transient chat stream, enabling persistent context and granular control
+ 1 more
vs. Jupyter Notebooks [not cited in paper]: Cocoa applies the notebook cell paradigm to natural language tasks and agent actions rather than just code execution

Limitations

Evaluation focused on scientific research; generalizability to other domains (e.g., coding, creative writing) is untested
Requires users to have some mental model of breaking tasks into steps (computational thinking)
Deployment scale was small (n=7) and short-term (1 week)

📊 Experiments & Results

Evaluation Setup

Qualitative and quantitative evaluation of human-agent collaboration in research tasks

Benchmarks:

Lab Study (Controlled research tasks) [New]
Field Deployment (Real-world daily research work) [New]

Metrics:

User qualitative feedback (steerability, utility)
Usage patterns (frequency of plan editing, self-assignment vs. agent-assignment)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Interleaving planning and execution is highly valued: Users frequently adjusted plans after seeing partial results, preventing error propagation.
Explicit delegation reduces friction: Users assigned difficult or ambiguous steps to themselves (e.g., reading a specific complex paper) while offloading rote tasks (e.g., summarization) to the agent.
The shared document representation served as an effective grounding for collaboration, superior to ephemeral chat interfaces for complex, multi-step work.
Users utilized the system to 'think with' the agent, using the plan generation as a way to structure their own thoughts even if they executed steps themselves.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with computational notebooks (cells, execution flow)
Understanding of mixed-initiative interaction
Basic knowledge of LLM agent capabilities (planning, tool use)

Key Terms

mixed-initiative interaction: A collaboration style where both the human and the computer can proactively contribute to a task, negotiating control and goals

co-planning: A collaborative process where the user and agent jointly create and modify a plan of action

co-execution: A collaborative process where the user and agent jointly carry out the steps of a plan, potentially with different steps assigned to different parties

computational notebook: A programming environment (like Jupyter) that combines code, execution results, and narrative text in a linear sequence of cells

ReAct: Reasoning + Acting—a paradigm where LLMs generate reasoning traces and task-specific actions in an interleaved manner

chain-of-thought (CoT): A prompting technique where the model generates intermediate reasoning steps before producing a final answer

Wizard-of-Oz (WoZ): A research method where a human simulates the behavior of a system (usually an AI) to test user interactions before the system is fully built