Problem Definition
Setting: Generalist agents interacting with digital environments to solve software and web-based tasks
Inputs: Natural language task description and an initial environment state
Outputs: Sequence of actions (code execution, browsing, file editing) leading to a task solution
Pipeline Flow
- Agent Strategy (Step Function)
- Action Generation
- Runtime Execution (Docker)
- Observation Feedback
System Modules
Agent
Perceives state and generates actions via a step function
Model or implementation: Various (e.g., CodeActAgent, BrowsingAgent, GPTSwarm)
Action Execution API
Executes actions inside the secure sandbox and returns observations
Model or implementation: REST API Server inside Docker
AgentSkills Library
Provides specialized utilities not easily writable by LLMs on-the-fly
Model or implementation: Python Package
Novel Architectural Elements
- Event Stream State: Encapsulates all history including multi-agent delegation metadata and LLM costs in a unified stream
- Standardized Action Primitives: Relies on general 'RunCode' or 'RunCommand' actions rather than rigid JSON tool definitions, allowing agents to write their own tools
- Multi-Agent Delegation: Implements `AgentDelegateAction` allowing generalist agents to offload subtasks to specialized agents (e.g., browsing)
Modeling
Base Model: Evaluated with various models including GPT-4o, Claude-3.5-Sonnet, and Llama-3