Narrowing Action Choices with AI Improves Human Sequential Decisions

📝 Paper Summary

Human-AI Collaboration Sequential Decision Making Decision Support Systems

This system improves human sequential decision-making by using an AI to restrict available actions to a high-quality subset, optimizing the subset size via a bandit algorithm to balance human and AI agency.

Core Problem

In sequential tasks, humans struggle to know when to cede agency to an AI predictor or exercise their own judgment, often leading to suboptimal reliance on decision support systems.

Why it matters:

Conventional decision support requires human experts to incorrectly estimate their own uncertainty versus the AI's uncertainty to achieve complementarity.
Full automation (AI-only) fails to leverage salient, hard-to-quantify information that human experts possess (e.g., qualitative observations in healthcare or disaster response).
Existing advice-based systems (recommenders) allow humans to ignore advice completely, failing to prevent obvious errors.

Concrete Example: In a wildfire mitigation game, a human player might choose a burning tile that looks dangerous but is actually low-priority. A standard AI might just recommend one tile, which the human might reject. This system forces the human to pick from the top-3 AI-ranked tiles, preventing a bad choice while allowing the human to use intuition to pick the best among the good options.

Key Novelty

Adaptive Action Sets via Lipschitz Bandits

Instead of recommending a single action, the system provides an 'action set' (a subset of allowed actions) derived from an AI agent's rankings.
The size of this set is controlled by a continuous parameter epsilon, which smoothly adjusts the level of human agency (epsilon=0 is AI-only, epsilon=1 is full human control).
Uses a novel bandit algorithm that leverages the smoothness (Lipschitz continuity) of the reward function to efficiently find the optimal epsilon during interaction.

Architecture

A Structural Causal Model (SCM) visualization of the decision-making process.

Evaluation Highlights

Humans supported by the system (at optimal agency level) outperform humans playing alone by 29.65% in a wildfire mitigation game.
The supported humans outperform the standalone AI agent (Deep Q-Network) by 2.31%, demonstrating true human-AI complementarity.
The proposed Lipschitz bandit algorithm achieves lower simple regret than a uniform discretization baseline given the same exploration budget.

Breakthrough Assessment

8/10

Offers a theoretically grounded and empirically validated method for sequential human-AI complementarity, achieving gains over both human-only and AI-only baselines in a large-scale study.

⚙️ Technical Details

Problem Definition

Setting: Sequential decision making modeled as a Structural Causal Model (SCM) where a human takes actions from a constrained set provided by a policy.

Inputs: Environment state Z_t and a potentially lossy representation S_t observed by the system.

Outputs: An action set C_t (subset of all possible actions A) from which the human must choose an action A_t.

Pipeline Flow

State Observation (Z_t) -> AI Agent Valuation (q(s,a))
Noise Injection (W) -> Action Scoring
Thresholding (epsilon) -> Action Set Construction (C_t)
Human Decision (A_t from C_t) -> Environment Transition

System Modules

AI Agent

Estimates the quality (Q-values) of all possible actions given the current state representation.

Model or implementation: Deep Q-Network (DQN)

Action Set Constructor

Filters actions to create the allowable subset C_t based on the agency parameter epsilon.

Model or implementation: Algorithmic Policy (Equation 3)

Lipschitz Bandit

Optimizes the parameter epsilon over time to maximize total cumulative reward.

Model or implementation: Zooming Algorithm (Algorithm 1)

Novel Architectural Elements

Integration of a continuum-armed bandit (Lipschitz bandit) directly into the human-AI loop to optimize the size of the action set (level of agency) in real-time.
The specific construction of the decision support policy using half-normal noise to ensure the reward function is Lipschitz continuous w.r.t. the agency parameter.

Modeling

Base Model: Deep Q-Network (DQN)

Training Method: The AI agent (DQN) is pre-trained; the Bandit algorithm optimizes the interaction parameter epsilon online.

Key Hyperparameters:

noise_sigma: 0.01 (standard deviation of half-normal noise W)
discount_factor_gamma: Approximates 1 (approx 0.99 implied by score-100 note)
bandit_discretization_levels: 100 (for the uniform baseline)

Compute: Experiments run on Debian machine with ARM EPYC 7662 processor, 20 cores, 32GB memory.

Comparison to Prior Work

vs. Algorithmic Triage: Offers a continuous spectrum of agency (action sets) rather than a binary hand-off.
vs. Action Recommendation: Enforces the AI's filtering (hard constraints on bad actions) while retaining choice among good ones.
vs. Action Masking in RL: Optimizes the mask (action set) to maximize *human-AI team* reward rather than just agent reward.

Limitations

Requires an estimate of the Lipschitz constant L for the bandit algorithm.
Assumes the decision making process eventually terminates (episodic).
Requires a pre-trained AI agent that provides reasonable value estimates.
Human participants were recruited from Prolific, which may not represent expert decision makers in high-stakes domains.

Reproducibility

Code: https://github.com/Networks-Learning/narrowing-action-choices

Code available at https://github.com/Networks-Learning/narrowing-action-choices. Human subject data available. Study recruited 1,600 participants via Prolific.

📊 Experiments & Results

Evaluation Setup

Wildfire mitigation game where participants act as firefighters on a 10x10 grid to prevent fire spread.

Benchmarks:

Wildfire Mitigation Game (Sequential Decision Making / Resource Allocation) [New]

Metrics:

Discounted Cumulative Reward (Score)
Simple Regret (for bandit performance)
Statistical methodology: t-test reported for distribution differences (p-value < 0.01).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of human performance when supported by the system vs. baselines.
Wildfire Game	Discounted Cumulative Reward	Not reported as absolute number	Not reported as absolute number	+29.65%
Wildfire Game	Discounted Cumulative Reward	Not reported as absolute number	Not reported as absolute number	+2.31%

Experiment Figures

Bar chart comparing average cumulative rewards across different agents/settings.

Simple regret curves for the bandit algorithm vs. uniform discretization as a function of exploration budget n.

Main Takeaways

Narrowing action choices allows humans to contribute complementarity even when the AI agent is significantly stronger than the average human on its own.
The optimal level of agency (epsilon) is neither full automation (0) nor full human control (1), but an intermediate value found efficiently by the algorithm.
The proposed Lipschitz bandit algorithm successfully identifies the optimal agency level with decreasing regret as samples increase, outperforming uniform discretization.

📚 Prerequisite Knowledge

Prerequisites

Sequential Decision Making / Markov Decision Processes (MDPs)
Multi-Armed Bandits (specifically Lipschitz/Continuum bandits)
Deep Q-Networks (DQN)
Structural Causal Models (SCM)

Key Terms

action set: A strict subset of all possible actions presented to the user, effectively filtering out poor choices while retaining a range of good options.

Lipschitz continuity: A smoothness property of a function where the rate of change is bounded; here, it means small changes in the agency parameter epsilon lead to limited changes in expected reward.

simple regret: The difference between the expected payoff of the optimal parameter choice and the parameter actually selected by the algorithm after n rounds.

zooming dimension: A measure of the difficulty of a bandit problem; it captures how many near-optimal arms need to be explored.

Deep Q-Network (DQN): A reinforcement learning algorithm that uses a neural network to estimate the value (Q-value) of taking a specific action in a specific state.

min-max normalization: Rescaling data (here, action scores) to a fixed range, typically [0, 1].

SFT: Supervised Fine-Tuning—training a model on labeled examples before applying reinforcement learning (mentioned in context of related work/baselines).