SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

📝 Paper Summary

Agentic reasoning Procedural memory Agent security

This paper formalizes agentic skills as reusable procedural memory modules with explicit applicability and termination logic, systematically mapping their lifecycle, design patterns, and security risks.

Core Problem

Agents currently suffer from fundamental inefficiency by re-deriving execution strategies for recurring tasks from scratch, as procedural knowledge disappears when the context window clears.

Why it matters:

Repeating the same reasoning process for identical tasks wastes computational resources (tokens) and increases latency
Ad-hoc planning is less reliable than executing verified, curated procedures (skills)
Lack of standardized skill definitions creates security vulnerabilities, such as unmanaged supply-chain risks in agent marketplaces

Concrete Example: A coding agent that has successfully debugged a null-pointer exception 100 times will approach the 101st instance as a novel problem, re-generating the plan from scratch rather than retrieving a known debugging procedure.

Key Novelty

Formalization and Systematization of Agentic Skills

Redefines skills not as simple tools, but as 4-tuple modules containing Applicability conditions (when to use), Policy (how to act), Termination (when to stop), and Interface (how to call)
Establishes a 7-stage lifecycle model (Discovery to Update) and a taxonomy of 7 design patterns for how skills are packaged and executed in real systems

Architecture

The formal 4-component architecture of an Agentic Skill

Evaluation Highlights

Curated skills increase agent pass rates by 16.2 percentage points on average compared to agents without skills (SkillsBench)
Self-generated skills degrade performance by 1.3 percentage points, often encoding incorrect or overly specific heuristics
Identified nearly 1,200 malicious skills in the ClawHavoc campaign case study, demonstrating scale of supply-chain risks

Breakthrough Assessment

9/10

A comprehensive foundational work (SoK) that establishes the formal definitions, taxonomies, and governance models necessary to move agents from ad-hoc planning to robust, reusable procedural memory.

⚙️ Technical Details

Problem Definition

Setting: Formalizing procedural knowledge for autonomous agents interacting with environments via observations O, actions A, and goals G

Inputs: Task context (observations and goals)

Outputs: Selection and execution of a reusable skill module S

Pipeline Flow

Discovery (Identifying patterns)
Refinement (Practicing/Optimizing)
Distillation (Packaging into S=(C,π,T,R))
Storage (Indexing)
Retrieval (Runtime selection)
Execution (Sandboxed running)
Evaluation (Monitoring/Update)

System Modules

Skill Tuple

The fundamental unit of reuse

Applicability Function (C) (Skill Component)

Gating mechanism to determine if the skill fits the context

Policy (π) (Skill Component)

Execution logic mapping states to actions

Termination Condition (T) (Skill Component)

Signals completion to the caller

Novel Architectural Elements

Formal separation of Applicability (C) and Termination (T) from the Policy (π) to enable external governance and composition
Explicit 'Callable Interface' (R) component allowing skills to be invoked programmatically like software libraries

Modeling

Base Model: Evaluated on various LLMs (specific model names for SkillsBench experiments not detailed in text excerpt)

Comparison to Prior Work

vs. Toolformer: Skills are multi-step procedures with internal logic (C, T) rather than atomic API calls
vs. Voyager: Generalizes the skill definition beyond Minecraft code to a universal tuple applicable to NL and hybrid policies
vs. ReAct: Skills are persistent and reusable across sessions, whereas ReAct traces are typically ephemeral
+ 1 more
vs. Plans: Skills are executable and governable modules, while plans are static reasoning artifacts

Limitations

Self-generated skills often degrade performance due to overfitting or incorrect heuristics
Unsupervised skill discovery remains a key open challenge; most systems rely on curricula or demonstrations
Significant security risks exist in marketplace distribution (e.g., ClawHavoc case study)

Reproducibility

The paper is a Systematization of Knowledge (SoK). It references 'SkillsBench' [24] as a benchmark, but the specific URL for the benchmark code is not provided in the text snippet.

📊 Experiments & Results

Evaluation Setup

Validation via case studies (ClawHavoc) and benchmark performance (SkillsBench)

Benchmarks:

SkillsBench (Agentic capability evaluation (pass rates)) [New]
ClawHavoc Case Study (Security/Malware analysis in agent marketplace) [New]

Metrics:

Pass Rate (Success Rate)
Number of malicious skills detected
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SkillsBench results demonstrate the value of curation over generation.
SkillsBench	Pass Rate Improvement	0.0	16.2	+16.2
SkillsBench	Pass Rate Improvement	0.0	-1.3	-1.3
Security analysis of the ClawHavoc campaign reveals significant supply chain vulnerabilities.
ClawHavoc Case Study	Malicious Skills Detected	0	1200	+1200

Experiment Figures

The 7-stage Skill Lifecycle Model

Main Takeaways

Procedural memory (skills) acts as an efficiency multiplier, allowing smaller models with curated skills to potentially outperform larger models without them
Reliability of skills is heavily dependent on curation; self-generated skills currently lack robust verification and often harm performance
The 'Skill Supply Chain' is a major new attack surface, as evidenced by the mass infiltration of malicious skills in marketplaces

📚 Prerequisite Knowledge

Prerequisites

Large Language Model (LLM) Agents
Reinforcement Learning (RL) concepts (policies, options framework)
Software Engineering concepts (modularity, interfaces)
Cognitive Science (procedural vs. declarative memory)

Key Terms

Agentic Skill: A reusable, callable module encapsulating a sequence of actions or policies to achieve a class of goals, distinct from atomic tools or one-off plans

Procedural Memory: Memory of 'how' to do things (skills/procedures) rather than 'what' happened (episodic) or facts (semantic)

SoK: Systematization of Knowledge—a type of research paper that organizes, classifies, and analyzes existing work rather than proposing a single new method

ClawHavoc: A specific security campaign analyzed in the paper where malicious skills infiltrated an agent marketplace

Prompt Injection: A security attack where malicious instructions are hidden in input data to manipulate the model's behavior

Supply-chain risk: Vulnerabilities arising from using third-party skills or plugins whose internal logic or dependencies may be compromised

Applicability Condition: A logic gate or predicate determining if a specific skill is valid for the current observation and goal