STRIDE: A Systematic Framework for Selecting AI Modalities -- Agentic AI, AI Assistants, or LLM Calls

📝 Paper Summary

Agentic Architecture Design Task Complexity Analysis

STRIDE is a design-time framework that systematically analyzes task complexity and dynamism to recommend whether a problem requires a simple LLM call, a guided assistant, or a fully autonomous agent.

Core Problem

Organizations currently deploy expensive, risky autonomous agents indiscriminately for tasks that could be solved by simpler methods, leading to over-engineering and governance issues.

Why it matters:

Overusing agents wastes compute resources and increases latency for simple queries.
Unnecessary agent autonomy introduces security risks like data leaks and system instability (e.g., recursive loops).
There is a lack of principled, evidence-based frameworks for deciding 'necessity' at design time; most choices are intuition-driven.

Concrete Example: A task like 'Generate a random greeting message' has output variability due to model stochasticity, but does not require an agent. Current approaches might mistake this variability for complexity and deploy an agent, whereas STRIDE identifies it as model-induced and recommends a stateless LLM call.

Key Novelty

Systematic Task Reasoning Intelligence Deployment Evaluator (STRIDE)

A 'shift-left' decision framework that operates at design time rather than deployment time, preventing over-engineering before code is written.
Introduces a 'True Dynamism Score' that distinguishes between variability caused by the model (randomness), tools (API volatility), and the workflow itself (conditional branching)—only the latter justifies full agents.
Calculates an Agentic Suitability Score (ASS) based on reasoning depth, tool needs, state requirements, and self-reflection necessity.

Architecture

The STRIDE workflow pipeline illustrating the process from input to recommendation.

Evaluation Highlights

Achieved 92% accuracy in modality selection across 30 real-world tasks in SRE and compliance domains.
Reduced unnecessary agent deployments by 45% compared to baseline intuition-driven choices.
Cut resource costs by 37% by routing simpler tasks to less expensive modalities.

Breakthrough Assessment

7/10

Significant practical contribution for enterprise AI adoption. While not a new model architecture, it provides a much-needed formal methodology for architectural decision-making, addressing the 'agent bloat' problem effectively.

⚙️ Technical Details

Problem Definition

Setting: Classification problem mapping a task description T to a modality M ∈ {LLM Call, AI Assistant, Agentic AI}

Inputs: Free-form task descriptions, input/output specifications, and potential tool dependencies

Outputs: Recommended AI Modality (LLM Call, Assistant, or Agent) and an Agentic Suitability Score

Pipeline Flow

Input Task Description → Task Decomposition (LLM-based) → Subtask Scoring (Reasoning, Tools, Dynamism) → Aggregation → Modality Recommendation

System Modules

Task Decomposer (Analysis)

Transforms free-form descriptions into a DAG of subtasks using a fine-tuned LLM

Model or implementation: Fine-tuned LLM (specific model not named)

Scoring Engine (Analysis)

Computes the Agentic Suitability Score (ASS) for each subtask based on reasoning, tools, state, and risk

Model or implementation: Mathematical formula weighting R(s), T(s), S(s), and ρ(s)

Dynamism Analyzer (Analysis)

Calculates True Dynamism Score (TDS) to distinguish workflow variability from model/tool noise

Model or implementation: Rule-based logic: TDS(s) = W(s) - (V(s) + M(s))

Self-Reflection Assessor (Analysis)

Determines if error recovery or meta-cognition is needed based on conditional branches and nondeterministic tools

Model or implementation: Decision rule: SR(s) = 1 if C(s) + N(s) + V(s) > θ

Modality Classifier

Aggregates subtask scores and queries knowledge base to output final recommendation

Model or implementation: Classifier f(x_T, K)

Novel Architectural Elements

Dynamism Attribution logic that mathematically subtracts model/tool noise from workflow complexity to prevent false positives for agent necessity
Integrated scoring pipeline combining static graph analysis (DAG) with dynamic requirement assessment (Reflection/State)

Modeling

Base Model: Fine-tuned LLM (specific architecture/size not reported in paper) used for decomposition

Training Method: Grid search optimization and Reinforcement Learning (RL)

Objective Functions:

Purpose: Calibrate weighting system for scoring.

Formally: Grid search on historical data followed by RL from deployment outcomes.

Key Hyperparameters:

reasoning_weight (wr): 0.4 (example for itinerary planning)
tool_weight (wt): 0.3 (example for API-heavy workflows)
state_weight (ws): 0.2 (example for multi-turn)
+ 1 more
risk_weight (wρ): 0.1 (example for compliance)

Compute: Not reported in the paper

Comparison to Prior Work

vs. AgentBoard: STRIDE focuses on design-time necessity assessment rather than post-deployment performance evaluation
vs. CrewAI: STRIDE is a decision framework to decide *if* an agent is needed, whereas CrewAI is an implementation framework
vs. AutoGen [not cited in paper]: AutoGen focuses on multi-agent conversation patterns; STRIDE focuses on the binary choice of whether to use agents at all

Limitations

Relies on the quality of the initial task description; ambiguous descriptions may lead to incorrect decomposition.
Weights for the scoring equation (wr, wt, etc.) require calibration via grid search/RL, which assumes availability of historical labeled data.
The specific LLM used for the decomposition step is not specified, hindering reproducibility.

Reproducibility

No replication artifacts mentioned in the paper (code, weights, or data not provided).

📊 Experiments & Results

Evaluation Setup

Validation on real-world enterprise tasks

Benchmarks:

30 Real-world Tasks (Enterprise Automation) [New]

Metrics:

Modality Selection Accuracy
Reduction in unnecessary agent deployments
Resource cost reduction
Expert alignment improvement
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
30 Real-world Tasks	Modality Selection Accuracy	Not reported in the paper	92%	Not reported in the paper
30 Real-world Tasks	Reduction in Agent Deployments	100 (Indexed)	55 (Indexed)	-45 (Indexed)
30 Real-world Tasks	Resource Cost Reduction	100 (Indexed)	63 (Indexed)	-37 (Indexed)
30 Real-world Tasks	Expert Alignment Improvement	Not reported in the paper	27%	Not reported in the paper

Main Takeaways

Current industry practice significantly over-deploys autonomous agents for tasks that do not require them.
Distinguishing 'workflow variability' (actual complexity) from 'model/tool variability' (noise) is critical for preventing over-engineering.
Structured task decomposition at design time effectively predicts the necessity of complex runtime architectures.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM capabilities vs. Agentic workflows
Familiarity with Directed Acyclic Graphs (DAGs) for task decomposition
Basic knowledge of API orchestration and tool use

Key Terms

Agentic AI: Autonomous systems that decompose tasks, orchestrate tools, and adapt plans with minimal oversight (highest complexity)

AI Assistant: Systems handling guided multi-step workflows with short-term context and limited tool access, requiring human oversight

LLM Call: Stateless, single-turn inference without memory or tools

DAG: Directed Acyclic Graph—a structure used here to map subtasks and their dependencies

True Dynamism Score: A metric isolating workflow-driven variability (branching, environmental changes) from simple model randomness or tool volatility

Agentic Suitability Score: A quantitative score aggregating reasoning depth, tool needs, state requirements, and risk to determine the necessary level of autonomy

Shift-left: Moving decision-making earlier in the software development lifecycle (e.g., to the design phase) to prevent issues later

SRE: Site Reliability Engineering—a discipline incorporating aspects of software engineering and applying them to infrastructure and operations problems