IronEngine: Towards General AI Assistant

📝 Paper Summary

Layered memory Multi-agent Multi-call tool use with flexible plan

IronEngine is a comprehensive local-first AI assistant framework that decouples planning from execution via a three-phase pipeline, utilizing heterogeneous models and hierarchical memory to solve fragmentation and reliability issues.

Core Problem

Current AI assistants suffer from fragmentation (disjoint tools), single-model bottlenecks (inefficient resource use), ephemeral memory (stateless sessions), and poor local deployment reliability.

Why it matters:

Users must switch between disjoint tools for different tasks (web, desktop, files) rather than using a unified interface
Single-model designs waste compute by using large models for simple formatting or fail at complex planning with small models
Lack of structured persistence forces users to re-teach preferences and workflows in every new session
Privacy-sensitive workloads require local execution, but managing VRAM for multiple models on consumer hardware is unsolved in most frameworks

Concrete Example: A small 3.8B parameter tool model might generate valid JSON but specify the wrong tool type (e.g., 'web_read' instead of 'web_search'). Standard systems fail outright, whereas a robust system should detect the semantic mismatch and redirect the request automatically.

Key Novelty

Unified Orchestration with Heterogeneous Model Allocation

Decouples cognitive roles into a three-phase pipeline (Discussion, Model Switch, Execution), assigning different model sizes to Planner (reasoning), Reviewer (quality gate), and Executor (tool use)
Implements a VRAM-aware model lifecycle that dynamically loads and unloads models on a single GPU to overcome hardware constraints
Features a dual-merge hierarchical memory system that combines fast hash-based deduplication with model-based daily consolidation for long-term retention

Evaluation Highlights

Reduces tool dispatch failures by an order of magnitude using alias normalization and automatic error correction compared to direct model routing
Successfully manages a 46,690-line codebase with 97 source files, integrating 24 tool categories under one orchestration core
Demonstrates desktop automation capability where standard accessibility approaches fail (e.g., WeChat) by falling back to visual analysis

Breakthrough Assessment

7/10

Strong systems engineering contribution addressing practical deployment issues (VRAM, fragmentation) often ignored in pure research. Novelty lies in the integration and lifecycle management rather than new model architectures.

⚙️ Technical Details

Problem Definition

Setting: General-purpose local AI assistant for desktop automation, web browsing, and file manipulation

Inputs: Natural language user requests (text/voice)

Outputs: Executed actions (GUI control, API calls, file ops) and natural language responses

Pipeline Flow

Discussion Phase (Planner + Reviewer loop)
Model Switch (VRAM management)
Execution Phase (Executor + Tools)
Memory Consolidation (Background)

System Modules

Planner (Discussion Phase)

Decomposes user request into a textual plan

Model or implementation: Large capable model (e.g., 14B+)

Reviewer (Discussion Phase)

Evaluates plan quality, safety, and feasibility; assigns numerical score (0.0-1.0)

Model or implementation: distinct model (potentially smaller or fine-tuned for critique)

Executor (Execution Phase)

Translates approved plan into specific tool calls

Model or implementation: Tool-specialized model (often smaller/faster)

Tool Router (Execution Phase)

Dispatches tool calls to actual implementations with error correction

Model or implementation: Rule-based logic + heuristic matching

Novel Architectural Elements

Three-phase pipeline (Discussion → Model Switch → Execution) explicitly separating planning quality from execution capability
VRAM-aware model lifecycle management enabling sequential loading of models larger than total GPU memory
Dual-merge memory lifecycle (Merge A: hash-deduplication, Merge B: model-based daily consolidation)

Modeling

Base Model: Heterogeneous mix (Ollama, LM Studio backends supported)

Training Method: Inference-time orchestration (system engineering focus)

Compute: Consumer-grade hardware (single 24GB GPU mentioned as target for swapping 14B/8B models)

Comparison to Prior Work

vs. OpenClaw: IronEngine uses a 3-phase pipeline with formal Reviewer quality gate vs. OpenClaw's message-routing gateway
vs. OpenClaw: IronEngine supports multi-model collaboration within a single task vs. single model per request
vs. AutoGPT: IronEngine implements VRAM-aware lifecycle management for local models [not cited in paper]
+ 1 more
vs. MemGPT: IronEngine integrates user ratings and contradiction detection into memory retrieval vs. virtual context paging

Limitations

Reliability of tool use heavily dependent on the specific local models loaded
Desktop automation is fragile across different UI frameworks (Qt vs Win32)
Fixed three-phase pipeline sacrifices topological flexibility of free-form multi-agent conversations
No dynamic spawning of new agent roles at runtime

Reproducibility

Code availability is ambiguous; paper mentions 'technical report and relevant source code... are fully, automatically designed' and references 97 source files, but specific GitHub URL is not in the text snippet. NiusRobotLab YouTube channel mentioned.

📊 Experiments & Results

Evaluation Setup

Systematic evaluation of tool routing and desktop automation reliability

Benchmarks:

File Operation Benchmarks (Local file manipulation) [New]
Desktop Automation Tasks (GUI interaction (e.g., WeChat)) [New]

Metrics:

Tool dispatch success rate
Task completion rate
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Alias normalization and automatic error correction reduce tool dispatch failures by an order of magnitude compared to direct model routing
Separating Planner and Executor allows using smaller, faster models for tool calls without sacrificing planning quality
The system successfully manages 24 tool categories and 130+ alias variants
VRAM-aware scheduling enables running a 14B Planner and 8B Reviewer sequentially on a 24GB GPU where simultaneous loading would fail

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM inference and context windows
Familiarity with agentic patterns (ReAct, RAG)
Basic knowledge of OS automation (accessibility APIs, CDP)

Key Terms

MCP: Model Context Protocol—a standard for connecting AI assistants to external data and tools

CDP: Chrome DevTools Protocol—allows tools to instrument, inspect, debug and profile Chromium, Chrome and other Blink-based browsers

VRAM: Video Random Access Memory—specialized memory on GPUs used to store model weights and context during inference

UIA: UI Automation—a Windows accessibility API allowing programs to inspect and drive other applications' user interfaces

RAG: Retrieval-Augmented Generation—fetching external data to ground LLM responses

Planner: The module responsible for decomposing high-level user goals into actionable steps

Reviewer: A distinct model role that evaluates the Planner's output for quality and safety before execution

Executor: The module or model that translates approved plans into specific tool invocations

Alias Normalization: Mapping various synonyms for a tool (e.g., 'google', 'browse') to a canonical internal identifier

Quantization: Reducing the precision of model weights (e.g., 4-bit vs 16-bit) to reduce memory usage and increase speed