Collaborating with AI Agents: Field Experiments on Teamwork, Productivity, and Performance

📝 Paper Summary

Human-AI Collaboration Agentic Workflows

A large-scale randomized experiment reveals that human-AI teams produce more ads with higher text quality but lower image quality than human-human teams, driven by shifts toward task-oriented communication and increased delegation.

Core Problem

Prior research on AI productivity typically studies chatbots as passive tools rather than active collaborators, lacking insight into how multimodal, autonomous agents reshape teamwork processes.

Why it matters:

Most existing studies use limited chatbots (not multimodal/agentic) or focus only on individual productivity, missing team-level dynamics.
There is a lack of rigorous randomized controlled trials (RCTs) measuring how active AI agents change 'in vivo' work processes like communication and delegation.
Understanding these dynamics is critical as AI moves from a tool to a teammate in professional workflows.

Concrete Example: In ad creation, a human team might spend time building rapport ('How are you?') and debating edits. An AI-augmented team might skip pleasantries, with the human delegating drafting to the AI and editing less, potentially speeding up text production but failing to catch visual nuances the AI misses.

Key Novelty

Pairit Platform Field Experiment

Develops 'Pairit', a collaborative workspace where AI agents can take the same actions as humans (edit text, generate images, chat), enabling direct comparison of human-human vs. human-AI teams.
Conducts a large-scale RCT (2,234 participants) combining lab-based ad creation with a real-world field test (5M impressions on X) to measure actual market performance.
Identifies specific teamwork mechanisms—task-oriented communication and delegation—that mediate productivity gains and quality shifts.

Evaluation Highlights

Human-AI teams produced 50% more ads per worker compared to human-human teams.
Human-AI teams delegated 17% more work to their partners and performed 62% fewer direct text edits.
Field experiment on X showed human-AI ads had higher click-through rates (driven by better text) while human-human ads had better cost-per-click (driven by better images).

Breakthrough Assessment

8/10

While not a new model architecture, this is a significant empirical breakthrough. It provides rare, large-scale experimental evidence on *how* agentic AI alters work processes, moving beyond simple 'productivity boost' claims to explain the mechanisms of delegation and communication.

⚙️ Technical Details

Problem Definition

Setting: Collaborative creation of marketing campaigns (ads) for a think tank's annual report.

Inputs: Task instructions, annual report content, collaborative chat interface.

Outputs: Completed ad units containing ad copy, call-to-action (CTA), and an image.

Pipeline Flow

Participant Assignment (Randomized Human-Human or Human-AI)
Collaborative Workspace (Pairit Platform)
Ad Generation (Text & Image)
Field Evaluation (Ads run on X)

System Modules

Pairit Platform

Host the collaborative session, manage chat, and log all interactions.

Model or implementation: Custom Web Application

AI Agent

Act as a teammate: send messages, write copy, generate images via API.

Model or implementation: GPT-4 (implied context, paper mentions 'LLM-based' and 'Dall-E 3')

Image Generator

Create visual assets for ads.

Model or implementation: Dall-E 3

Novel Architectural Elements

Full parity of action space: The AI agent is not just a chatbot but a user in the system that can manipulate the same UI elements (text fields, image selection) as the human.

Modeling

Base Model: Not explicitly named in Core text (likely GPT-4 for text/logic and Dall-E 3 for images based on context)

Training Method: Prompt Engineering / System Prompting

Compute: Not reported in the paper

Comparison to Prior Work

vs. Noy and Zhang: Studies teams rather than individuals; uses active agents rather than passive chatbots.
vs. Brynjolfsson et al.: Focuses on creative/generative tasks rather than support scripts; measures agentic collaboration.
vs. Dell’Acqua et al.: Explores teamwork mechanisms (delegation, communication) explicitly rather than just output quality.
+ 1 more
vs. AutoGen [not cited in paper]: AutoGen focuses on multi-agent software engineering; Pairit focuses on human-agent mixed teams in creative marketing tasks.

Limitations

Study context is limited to advertising creation; results might not transfer to other domains like coding or scientific discovery.
The specific AI model version (e.g., GPT-4 vs GPT-4o) is not explicitly detailed, which could influence the 'jagged frontier' boundary.
Diversity collapse finding suggests potential long-term stagnation in creativity if not managed.
Reliance on a specific platform (Pairit) means UI/UX choices could confound some collaboration behaviors.

Reproducibility

The Pairit platform code is not provided. The specific prompts used for the AI agent are not explicitly detailed in the main text. The dataset size is large (11,024 ads), but availability is not specified.

📊 Experiments & Results

Evaluation Setup

Randomized Controlled Trial (RCT) with 2,234 participants creating ads, followed by a field experiment on X (Twitter).

Benchmarks:

Ad Creation Task (Creative Content Generation) [New]

Metrics:

Ads produced per worker (Productivity)
Text Quality (Human & AI ratings)
Image Quality (Human & AI ratings)
Output Diversity (Self-similarity)
Click-Through Rate (CTR)
Cost-Per-Click (CPC)
View-Through Rate (VTR)
Statistical methodology: Randomized assignment; regression analysis to determine effects of treatment (AI vs Human partner) on outcomes.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Productivity and output characteristics show massive volume gains but mixed quality shifts for Human-AI teams.
Ad Creation Task	Ads per worker	Not explicitly reported in the paper	Not explicitly reported in the paper	+50%
Ad Creation Task	Direct Text Edits	Not explicitly reported in the paper	Not explicitly reported in the paper	-62%
Ad Creation Task	Delegation Amount	Not explicitly reported in the paper	Not explicitly reported in the paper	+17%
Ad Creation Task	Task-Oriented Messages	Not explicitly reported in the paper	Not explicitly reported in the paper	+25%
Ad Creation Task	Interpersonal Messages	Not explicitly reported in the paper	Not explicitly reported in the paper	-18%

Main Takeaways

Evidence of a 'jagged frontier': Human-AI teams excelled at text quality (improving CTR/VTR) but lagged in image quality (hurting CPC), while human-human teams showed the reverse.
Diversity Collapse: Human-AI teams produced more homogeneous (self-similar) ads, indicating a reduction in creative variance.
Mechanism of Action: The productivity gains are driven by a shift from 'doing' to 'delegating' and a reduction in social maintenance communication.
Recognition Effect: Participants who correctly identified their partner as AI were more task-oriented and delegated more, suggesting that accurate mental models of the partner are crucial for effective human-AI collaboration.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of randomized controlled trials (RCTs)
Familiarity with digital advertising metrics (CTR, CPC)
Concept of 'jagged frontier' in AI capabilities

Key Terms

jagged frontier: The uneven landscape of AI capabilities where AI excels at some tasks (e.g., text generation) but fails at others (e.g., complex visual nuance) relative to humans.

diversity collapse: The tendency for AI-generated outputs to be more homogeneous (self-similar) than human-generated outputs.

delegation: Assigning task execution to a partner; in this paper, measured by the volume of work requests sent to the AI or human partner.

task-oriented communication: Messages focused on goals, strategy, and execution (e.g., instructions, suggestions) rather than social maintenance.

interpersonal communication: Messages focused on relationship building, rapport, and emotion (e.g., self-assessment, concern).

CTR: Click-Through Rate—the percentage of people who click on an ad after seeing it.

CPC: Cost-Per-Click—the actual price paid for each click on an ad.

VTR: View-Through Rate—the percentage of users who view the content (annual report) after clicking the ad.