Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

📝 Paper Summary

Human-AI Collaboration AI Auditing Future of Work

This paper audits the U.S. workforce by surveying 1,500 workers and 52 AI experts to map occupational tasks against worker desires for automation versus technological feasibility, revealing critical misalignments.

Core Problem

We lack a systematic understanding of which specific occupational tasks workers actually *want* AI agents to automate or augment, and how those desires align with current technical capabilities.

Why it matters:

Current narratives rely on binary 'automate-or-not' views or usage logs (e.g., chatbot history) that reflect early adopters rather than broader workforce needs
Investments are skewed: 41% of Y Combinator AI startups focus on tasks where workers have low desire for automation or where technology is already capable but unwanted
Ignoring worker agency risks job displacement anxiety and friction in adoption, as workers prefer augmentation (partnership) over full automation for many tasks

Concrete Example: In the 'Arts, Designs, and Media' sector, only 17.1% of tasks receive positive automation desire ratings. While technically capable of generating content, AI agents here face resistance because workers value creative control, preferring AI for project management rather than replacing the core artistic process.

Key Novelty

WORKBank Database & Human Agency Scale (HAS)

Introduces the Human Agency Scale (H1-H5) to quantify the necessary level of human involvement, moving beyond a binary automation view to a spectrum including augmentation and partnership
Constructs 'WORKBank' by cross-referencing O*NET occupational tasks with survey data from 1,500 domain workers (desire/agency) and 52 AI experts (technical feasibility)
Uses an audio-enhanced survey format to allow workers to verbally reflect on their daily tasks, yielding more grounded and nuanced preference data than standard text surveys

Architecture

The Auditing Framework workflow

Evaluation Highlights

Workers express positive desire for automation in 46.1% of tasks, primarily to free up time for high-value work, though this varies significantly by sector
41.0% of Y Combinator company-task mappings fall into the 'Low Priority' or 'Red Light' zones (low worker desire), indicating a mismatch between capital investment and worker needs
For 45.2% of occupations, the dominant preferred interaction model is H3 (Equal Partnership), signaling a strong demand for collaborative augmentation rather than full automation

Breakthrough Assessment

8/10

Provides the first large-scale, grounded dataset (WORKBank) linking technical AI agent capabilities with actual worker preferences at the task level. Effectively challenges the 'automation-first' narrative.

⚙️ Technical Details

Problem Definition

Setting: Survey-based auditing of occupational tasks for AI suitability

Inputs: Occupational tasks from O*NET database (filtered for computer compatibility)

Outputs: Worker ratings for Automation Desire (Aw) and Desired Human Agency (Hw); Expert ratings for Automation Capability (Ae) and Feasible Human Agency (He)

Pipeline Flow

Task Sourcing (O*NET)
Worker Survey (Audio-Enhanced)
Expert Annotation
Data Aggregation & Analysis

System Modules

Task Sourcing (Data Collection)

Select complex, multi-step tasks from O*NET

Model or implementation: Filtering criteria

Worker Survey Interface (Data Collection)

Collect worker preferences (Aw, Hw) with audio reflection

Model or implementation: Custom Web Interface

Expert Annotation (Data Collection)

Assess technical feasibility (Ae, He)

Model or implementation: Human Experts (52 researchers/practitioners)

Novel Architectural Elements

Audio-enhanced mini-interview module in survey design to capture nuanced worker reasoning before quantitative rating
Dual-audit topology: Simultaneously mapping tasks against 'Worker Desire' and 'Technical Capability' to create a four-zone landscape

Comparison to Prior Work

vs. Eloundou et al. (2023): WORKBank includes direct *worker preference* data rather than just theoretical exposure
vs. Handa et al. (2025): Forward-looking audit of *potential* and *desire* rather than retrospective analysis of current chatbot usage logs
vs. Hoffmann et al. (2024): Covers 104 diverse occupations rather than a single domain like software engineering

Limitations

Expert agreement on capability ratings is moderate (Krippendorff’s alpha ~0.5), reflecting the difficulty in assessing rapidly evolving AI limits
Survey sample is limited to 1,500 workers across 104 occupations, which may not fully represent the entire U.S. labor market
Relies on self-reported desires, which may change as workers become more familiar with actual AI agent capabilities

Reproducibility

The paper describes the methodology for creating WORKBank and the definitions for the HAS scale. The dataset itself (WORKBank) is described as a contribution, implying availability, though a specific URL is not provided in the text. The survey interface and expert recruitment details are in the Appendix.

📊 Experiments & Results

Evaluation Setup

Statistical analysis of survey data (workers) and annotation data (experts) across occupational tasks

Benchmarks:

WORKBank (Occupational Task Audit) [New]

Metrics:

Automation Desire Score (Aw)
Automation Capability Score (Ae)
Human Agency Scale (HAS) Level (H1-H5)
Statistical methodology: Spearman correlation coefficients reported for relationships between desire/capability and factors like enjoyment/job loss concern

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of worker automation desire reveals specific patterns based on task nature and sector.
WORKBank	Percentage of tasks with positive automation desire	53.9	46.1	-7.8
WORKBank	Positive Automation Desire Rate	17.1	Not reported in the paper	Not reported in the paper
Correlation analysis shows weak alignment between worker desires and technical capabilities.
WORKBank	Spearman Correlation (rho)	1.0	0.17	-0.83
WORKBank	Spearman Correlation (rho)	0.0	-0.28	-0.28

Experiment Figures

The Desire-Capability Landscape (Scatter plot of Worker Desire vs. Expert Capability)

The Human Agency Scale (HAS) definitions

Main Takeaways

Workers strongly prefer 'Equal Partnership' (H3) or higher agency for most tasks, rejecting the idea of full automation (H1/H2) for critical work
Current startup investment (proxied by YC companies) is inefficiently distributed: 41% targets 'Red Light' (high capability, low desire) or 'Low Priority' zones
There is a shift in core competencies: skills like 'analyzing information' are becoming less valuable/wage-correlated, while 'interpersonal' and 'organizational' skills are gaining importance
Top reasons for desiring automation are 'freeing up time' (69.4%) and 'repetitiveness' (46.6%), whereas resistance stems from 'lack of trust' (45.0%) and 'loss of human touch' (16.3%)

📚 Prerequisite Knowledge

Prerequisites

Understanding of the O*NET database structure (occupations vs. tasks)
Basic familiarity with AI agents and Large Language Models (LLMs)
Knowledge of Likert scales for survey data

Key Terms

O*NET: The Occupational Information Network—a free online database that contains hundreds of occupational definitions to help students, job seekers, businesses, and workforce development professionals

HAS: Human Agency Scale—a novel 5-point scale introduced in this paper to quantify the degree of human involvement required for a task (H1=Fully Autonomous AI, H5=Human Essential)

SAE levels: Society of Automotive Engineers levels of driving automation (L0-L5); used here as a contrast to the proposed Human Agency Scale

Likert scale: A psychometric scale commonly involved in research that employs questionnaires (e.g., 'Strongly Agree' to 'Strongly Disagree')

Y Combinator (YC): A prominent startup accelerator; used in this paper as a proxy for current industry investment trends in AI

WORKBank: The dataset constructed in this paper, combining O*NET tasks with worker preferences and expert capability ratings

Inter-annotator agreement: A measure of how well two or more raters agree; here measured by Krippendorff's alpha