AI Agents for Inventory Control: Human-LLM-OR Complementarity

📝 Paper Summary

Agentic AI Human-AI Collaboration Operations Research (OR) integration

Combines traditional Operations Research heuristics, LLM reasoning, and human oversight into a hybrid inventory management pipeline, demonstrating that these components are complementary rather than substitutes.

Core Problem

Traditional inventory algorithms are brittle to demand shifts and lack context, while LLMs lack mathematical precision for stock calculations, and human decision-makers are inconsistent.

Why it matters:

Inventory control is fundamental to supply chains but struggles with non-stationary demand (trends, shocks) and unobservable contexts (news, seasonality)
Purely algorithmic solutions fail when historical data doesn't reflect the current environment
Human-AI teams often fail to outperform the better of the two acting alone; proving genuine complementarity in high-stakes operations is an open challenge

Concrete Example: An OR algorithm seeing a demand spike might assume it's noise and understock, while an LLM reading a product description knows 'swimwear' has seasonal demand. Conversely, an LLM might hallucinate the arithmetic of pipeline inventory, which the OR algorithm calculates perfectly.

Key Novelty

OR-Augmented LLM Agents with Human-in-the-Loop

Uses a standard OR heuristic (capped base-stock policy) to generate a mathematically grounded 'recommendation' that the LLM can adopt or override based on textual context
Implements a 'carry-over insight' memory mechanism where the LLM writes concise memos about structural changes (e.g., 'lead time is actually 3 weeks') to pass to future steps
Formalizes 'individual-level complementarity' to prove that humans add value to the AI pipeline, rather than just selecting when to use it

Architecture

The OR->LLM agent architecture showing how inputs are processed and decisions made in each period.

Evaluation Highlights

OR→LLM agent (Gemini 3 Flash) achieves 0.538 normalized profit, a 21% improvement over the OR heuristic alone on InventoryBench
Human-in-the-loop (Mode B: OR→LLM→Human) significantly outperforms fully automated agents (OR→LLM) and Human-only baselines
Theoretical analysis estimates that at least 20.3% of individual participants experience strictly positive complementarity (performing better with AI than either they or the AI could alone)

Breakthrough Assessment

8/10

Strong empirical evidence of Human-AI complementarity in a complex domain, backed by a new benchmark (InventoryBench) and a theoretical framework for measuring individual-level gains.

⚙️ Technical Details

Problem Definition

Setting: Multi-period inventory control with lost sales, deterministic but unknown lead times, and non-stationary demand

Inputs: Current inventory It, demand history, arrival history, contextual text xt (product info, calendar), and OR heuristic recommendations

Outputs: Order quantity qt to maximize total profit over horizon T

Pipeline Flow

OR Module (Calculates base stats & recommendation)
LLM Agent (Reads Context + OR Rec -> Decides Order)
Human (Optional: Reviews LLM reasoning -> Final Decision)

System Modules

OR Heuristic

Generate mathematically grounded inventory targets based on historical data

Model or implementation: Capped Base-Stock Policy (Xin, 2021)

LLM Agent

Synthesize OR output with world knowledge and context to handle anomalies

Model or implementation: Gemini 3 Flash / Grok 4.1 Fast / GPT-5 Mini (evaluated variants)

Human Decision Maker

Provide final judgment or strategic guidance

Model or implementation: Human (via Web Interface)

Novel Architectural Elements

OR→LLM pipeline where the algorithm acts as a 'calculator' tool providing a soft recommendation rather than a hard constraint
Carry-over insight mechanism allowing the LLM to maintain a compact, evolving mental model of environment parameters (like lead time shifts) across a 50-period horizon

Modeling

Base Model: Gemini 3 Flash, Grok 4.1 Fast, GPT-5 Mini (as evaluated in paper)

Comparison to Prior Work

vs. Standard OR: Uses LLM to handle non-stationarity and textual context which OR ignores
vs. RL for Inventory: Uses pre-trained LLM world knowledge instead of training from scratch; zero-shot/few-shot adaptation
vs. Duan et al. (2025) [cited]: Uses LLM for daily decision-making reasoning, not just for parameter extraction (e.g., estimating holding cost) [cited in paper]
+ 1 more
vs. OptiGuide [not cited in paper]: Focuses on direct operational control with human oversight, rather than just interpreting optimization results

Limitations

LLMs can still struggle with precise inventory pipeline tracking (arithmetic errors)
LLMs are less calibrated to specific cost tradeoffs (under- vs. over-stocking) than OR algorithms
Study limited to single-product settings; does not address multi-echelon or joint inventory optimization

Reproducibility

Code: https://github.com/TianyiPeng/AI-human-inventory-game.git

📊 Experiments & Results

Evaluation Setup

Sequential decision-making over 50 periods (T=50) with varying demand patterns and lead times

Benchmarks:

InventoryBench (Multi-period inventory control) [New]

Metrics:

Normalized Profit (0-1 scale, where 1 is perfect foresight)
Complementarity (Individual and Population level)
Statistical methodology: Pre-registered experiment. Statistical significance reported for human experiments.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of automated agents on InventoryBench (Gemini 3 Flash). The combination of OR and LLM outperforms either in isolation.
InventoryBench	Normalized Profit	0.445	0.538	+0.093
InventoryBench	Normalized Profit	0.334	0.538	+0.204
Human-in-the-loop experiment results (Real-data instances). Human collaboration adds value beyond the best automated agent.
Real-data instances	Normalized Profit	0.534	0.584	+0.050
Real-data instances	Normalized Profit	0.540	0.584	+0.044
Real-data instances	Estimated Fraction of Positive Complementarity	0	0.203	+0.203

Main Takeaways

Complementary Strengths: OR provides calculation and stability; LLMs provide context handling and shift detection; Humans provide safety and nuanced judgment.
LLMs alone struggle with cost-calibration (balancing over/underage costs) compared to OR, but excel at identifying supply disruptions (e.g., lost orders) that OR misses.
Human-AI collaboration (Mode B) works best when the human retains final decision authority but reviews LLM reasoning, outperforming both 'Human alone' and 'AI alone'.
The 'carry-over insight' memory mechanism is crucial for the LLM to adapt to structural changes (like lead time shifts) over the long horizon.

📚 Prerequisite Knowledge

Prerequisites

Basics of supply chain management (inventory, lead times)
Large Language Model prompting strategies (chain-of-thought)
Basic probability (distributions, means)

Key Terms

OR: Operations Research—a discipline using advanced analytical methods (like mathematical optimization) to make better decisions

base-stock policy: An inventory strategy where an order is placed in every period to bring the inventory position up to a target level S

critical fractile: A ratio (p / p+h) determining the optimal probability of satisfying demand; balances the cost of understocking (lost profit p) vs. overstocking (holding cost h)

lead time: The delay between placing an order and receiving the goods

lost sales: A setting where unsatisfied demand is lost forever (the customer goes elsewhere) rather than backlogged

carry-over insight: A memory mechanism where the agent writes a short text summary of learned environment dynamics (e.g., 'demand is trending up') to include in the next step's context