AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning

📝 Paper Summary

LLM-augmented Autonomous Agents (LAAs) Multi-agent orchestration Agent architecture comparison

BOLAA orchestrates multiple specialist agents (e.g., search-only and click-only) under a central controller to outperform single generalist agents on complex decision-making tasks.

Core Problem

Current investigations into LLM-augmented agents (LAAs) lack comprehensive comparisons of architectures (like ReAct vs. Planning) and struggle to scale single agents to complex open-domain tasks due to context limits and hallucination.

Why it matters:

Optimal agent architecture remains undetermined, with limited understanding of how different LLM backbones perform across different agent designs.
Single agents handling multiple action types (reasoning, searching, clicking) often fail in complex environments due to divided attention and context constraints.
Existing benchmarks often fail to jointly evaluate the interplay between agent architecture and the underlying LLM backbone.

Concrete Example: In a web navigation task, a single agent must decide whether to 'click' a button or 'search' for a query. A generalist agent might hallucinate a click action when it should search. BOLAA splits this into a 'search agent' and 'click agent', ensuring each focuses only on its specific action type.

Key Novelty

BOLAA (Multi-Agent Orchestration with Controller)

Decouples complex tasks into distinct labor agents (e.g., one for searching, one for clicking) managed by a central controller.
The controller selects the most relevant labor agent for the current state and manages communication, rather than a single LLM trying to handle all action types.
Provides a unified benchmark comparing 6 distinct agent architectures (ZeroShot, ReAct, PlanAct, etc.) across multiple open-source and proprietary LLMs.

Architecture

The BOLAA architecture diagram showing the Controller and Labor Agents Pool.

Evaluation Highlights

BOLAA achieves highest rewards on WebShop decision-making tasks compared to 5 other architectures (ReAct, PlanAct, etc.), especially with high-performing LLMs.
BOLAA with a smaller 3B model (fastchat-t5-3b) performs comparably to single-agent architectures using much larger models, demonstrating the efficiency of specialized orchestration.
Llama-2-70b performs best under the BOLAA architecture, while Llama-2-13b favors PlanAct, showing that optimal architecture depends on model size.

Breakthrough Assessment

7/10

Provides a valuable, comprehensive benchmark of agent architectures often taken for granted. The proposed BOLAA architecture validates the 'mixture of experts/agents' intuition for complex tasks.

⚙️ Technical Details

Problem Definition

Setting: Sequential decision-making in interactive environments (WebShop, HotPotQA) using LLM-based agents.

Inputs: Task instruction (e.g., 'Find a tripod under $130') and current environment observation (simplified HTML/text).

Outputs: Executable action (e.g., 'click[button]', 'search[query]', 'finish[answer]').

Pipeline Flow

Controller receives task/observation
Controller selects specialized Labor Agent (e.g., Click Agent or Search Agent)
Selected Labor Agent generates action
Controller parses and executes action in Environment
Environment returns new observation

System Modules

Controller

Selects which labor agent to call based on current state and history; manages communication.

Model or implementation: LLM backbone (e.g., Llama-2-70b, GPT-3.5)

Labor Agents Pool

Contains specialized agents (e.g., SearchAgent, ClickAgent) that only generate specific types of actions.

Model or implementation: LLM backbone (can be same or different from Controller)

Novel Architectural Elements

Controller-Labor architecture where labor agents are restricted to specific action types (Search vs Click), unlike standard multi-agent systems where agents might be generic but have different personas.

Modeling

Base Model: Evaluated multiple backbones: Llama-2 (7b/13b/70b), Vicuna (3b/13b/33b), MPT (7b/30b), OpenAI models (GPT-3.5, text-davinci-003).

Training Method: Prompt Engineering / In-context Learning (No fine-tuning of the models themselves in this paper's core contribution; it compares pre-trained/instruct-tuned models)

Key Hyperparameters:

context_length: Varies by model (e.g., 4k for Llama-2, 16k for LongChat)
prompt_style: Zero-shot or Few-shot depending on architecture variant (ZS-LAA vs ReAct)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ReAct: BOLAA uses multiple specialized agents managed by a controller rather than one monolithic agent loop.
vs. ReWOO: BOLAA focuses on orchestrating specialized action-type agents (search vs click) rather than just separating planning from execution.
vs. AutoGPT [not cited in paper]: BOLAA structures the multi-agent interaction hierarchically (Controller -> Labor) specifically around action types, whereas AutoGPT often spawns sub-agents dynamically for sub-goals.

Limitations

Controller overhead: Requires an extra LLM call to select the agent, potentially increasing latency and cost.
Performance depends heavily on the base LLM's ability to follow controller instructions.
Only evaluated on two environments (WebShop and HotPotQA), limiting claims of generalizability.
No statistical significance tests reported for the performance differences.

Reproducibility

Code: https://github.com/JimSalesforce/BOLAA

Code is publicly available at https://github.com/JimSalesforce/BOLAA. The paper uses standard open benchmarks (WebShop, HotPotQA) and open models (Llama-2, Vicuna), ensuring high reproducibility.

📊 Experiments & Results

Evaluation Setup

Simulation environments for decision making and reasoning.

Benchmarks:

WebShop (Web navigation / E-commerce decision making)
HotPotQA (Multi-hop Question Answering with Wikipedia API)

Metrics:

Reward (WebShop: Attribute overlap ratio; HotPotQA: F1 score)
Recall (WebShop: Whether ground truth item was retrieved)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on WebShop show BOLAA consistently outperforming single-agent architectures across various LLM backbones.
WebShop	Reward	0.55	0.62	+0.07
WebShop	Reward	0.48	0.51	+0.03

Experiment Figures

Comparison of solo agent architectures: ZeroShot (ZS-LAA), ZeroShotThink (ZST-LAA), and ReAct LAA.

Architecture of PlanAct LAA and PlanReAct LAA.

Main Takeaways

BOLAA (multi-agent orchestration) yields the best performance on WebShop compared to solo architectures like ReAct and PlanAct, particularly with powerful models like Llama-2-70b.
Smaller models (e.g., 3B parameters) utilizing the BOLAA architecture can rival the performance of larger models using single-agent architectures, suggesting orchestration is a compute-efficient strategy.
Agent architecture must be aligned with the LLM backbone; larger models (70b) benefit more from the complex coordination of BOLAA, while mid-sized models (13b) may prefer PlanAct.
Simply increasing context length (e.g., LongChat-16k) does not guarantee better performance; hallucination can increase with longer context if the agent architecture isn't robust.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM prompting strategies (Zero-shot, Few-shot)
Familiarity with ReAct (Reasoning + Acting) pattern
Basic knowledge of web navigation and QA environments

Key Terms

LAA: LLM-augmented Autonomous Agent—an agent that uses an LLM as its core controller to generate actions.

BOLAA: The proposed multi-agent architecture where a controller orchestrates specialized labor agents (e.g., separate agents for clicking vs. searching).

ReAct: Reason+Act—an agent architecture that prompts the LLM to generate a thought/reasoning trace before emitting an action.

PlanAct: An architecture where the agent generates a high-level plan before starting the interaction loop.

WebShop: A simulated e-commerce environment for evaluating web agents on searching and purchasing items.

HotPotQA: A question-answering dataset requiring multi-hop reasoning across multiple documents.

CoT: Chain-of-Thought—prompting the model to generate intermediate reasoning steps.

Hallucination: When an agent generates actions or facts not supported by the environment or observation.