BOLAA orchestrates multiple specialist agents (e.g., search-only and click-only) under a central controller to outperform single generalist agents on complex decision-making tasks.
Core Problem
Current investigations into LLM-augmented agents (LAAs) lack comprehensive comparisons of architectures (like ReAct vs. Planning) and struggle to scale single agents to complex open-domain tasks due to context limits and hallucination.
Why it matters:
Optimal agent architecture remains undetermined, with limited understanding of how different LLM backbones perform across different agent designs.
Single agents handling multiple action types (reasoning, searching, clicking) often fail in complex environments due to divided attention and context constraints.
Existing benchmarks often fail to jointly evaluate the interplay between agent architecture and the underlying LLM backbone.
Concrete Example:In a web navigation task, a single agent must decide whether to 'click' a button or 'search' for a query. A generalist agent might hallucinate a click action when it should search. BOLAA splits this into a 'search agent' and 'click agent', ensuring each focuses only on its specific action type.
Key Novelty
BOLAA (Multi-Agent Orchestration with Controller)
Decouples complex tasks into distinct labor agents (e.g., one for searching, one for clicking) managed by a central controller.
The controller selects the most relevant labor agent for the current state and manages communication, rather than a single LLM trying to handle all action types.
Provides a unified benchmark comparing 6 distinct agent architectures (ZeroShot, ReAct, PlanAct, etc.) across multiple open-source and proprietary LLMs.
Architecture
The BOLAA architecture diagram showing the Controller and Labor Agents Pool.
Evaluation Highlights
BOLAA achieves highest rewards on WebShop decision-making tasks compared to 5 other architectures (ReAct, PlanAct, etc.), especially with high-performing LLMs.
BOLAA with a smaller 3B model (fastchat-t5-3b) performs comparably to single-agent architectures using much larger models, demonstrating the efficiency of specialized orchestration.
Llama-2-70b performs best under the BOLAA architecture, while Llama-2-13b favors PlanAct, showing that optimal architecture depends on model size.
Breakthrough Assessment
7/10
Provides a valuable, comprehensive benchmark of agent architectures often taken for granted. The proposed BOLAA architecture validates the 'mixture of experts/agents' intuition for complex tasks.
⚙️ Technical Details
Problem Definition
Setting: Sequential decision-making in interactive environments (WebShop, HotPotQA) using LLM-based agents.
Inputs: Task instruction (e.g., 'Find a tripod under $130') and current environment observation (simplified HTML/text).
Controller parses and executes action in Environment
Environment returns new observation
System Modules
Controller
Selects which labor agent to call based on current state and history; manages communication.
Model or implementation: LLM backbone (e.g., Llama-2-70b, GPT-3.5)
Labor Agents Pool
Contains specialized agents (e.g., SearchAgent, ClickAgent) that only generate specific types of actions.
Model or implementation: LLM backbone (can be same or different from Controller)
Novel Architectural Elements
Controller-Labor architecture where labor agents are restricted to specific action types (Search vs Click), unlike standard multi-agent systems where agents might be generic but have different personas.
Training Method: Prompt Engineering / In-context Learning (No fine-tuning of the models themselves in this paper's core contribution; it compares pre-trained/instruct-tuned models)
Key Hyperparameters:
context_length: Varies by model (e.g., 4k for Llama-2, 16k for LongChat)
prompt_style: Zero-shot or Few-shot depending on architecture variant (ZS-LAA vs ReAct)
Compute: Not reported in the paper
Comparison to Prior Work
vs. ReAct: BOLAA uses multiple specialized agents managed by a controller rather than one monolithic agent loop.
vs. ReWOO: BOLAA focuses on orchestrating specialized action-type agents (search vs click) rather than just separating planning from execution.
vs. AutoGPT [not cited in paper]: BOLAA structures the multi-agent interaction hierarchically (Controller -> Labor) specifically around action types, whereas AutoGPT often spawns sub-agents dynamically for sub-goals.
Limitations
Controller overhead: Requires an extra LLM call to select the agent, potentially increasing latency and cost.
Performance depends heavily on the base LLM's ability to follow controller instructions.
Only evaluated on two environments (WebShop and HotPotQA), limiting claims of generalizability.
No statistical significance tests reported for the performance differences.
Code is publicly available at https://github.com/JimSalesforce/BOLAA. The paper uses standard open benchmarks (WebShop, HotPotQA) and open models (Llama-2, Vicuna), ensuring high reproducibility.
📊 Experiments & Results
Evaluation Setup
Simulation environments for decision making and reasoning.
HotPotQA (Multi-hop Question Answering with Wikipedia API)
Metrics:
Reward (WebShop: Attribute overlap ratio; HotPotQA: F1 score)
Recall (WebShop: Whether ground truth item was retrieved)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Results on WebShop show BOLAA consistently outperforming single-agent architectures across various LLM backbones.
WebShop
Reward
0.55
0.62
+0.07
WebShop
Reward
0.48
0.51
+0.03
Experiment Figures
Comparison of solo agent architectures: ZeroShot (ZS-LAA), ZeroShotThink (ZST-LAA), and ReAct LAA.
Architecture of PlanAct LAA and PlanReAct LAA.
Main Takeaways
BOLAA (multi-agent orchestration) yields the best performance on WebShop compared to solo architectures like ReAct and PlanAct, particularly with powerful models like Llama-2-70b.
Smaller models (e.g., 3B parameters) utilizing the BOLAA architecture can rival the performance of larger models using single-agent architectures, suggesting orchestration is a compute-efficient strategy.
Agent architecture must be aligned with the LLM backbone; larger models (70b) benefit more from the complex coordination of BOLAA, while mid-sized models (13b) may prefer PlanAct.
Simply increasing context length (e.g., LongChat-16k) does not guarantee better performance; hallucination can increase with longer context if the agent architecture isn't robust.
📚 Prerequisite Knowledge
Prerequisites
Understanding of LLM prompting strategies (Zero-shot, Few-shot)
Familiarity with ReAct (Reasoning + Acting) pattern
Basic knowledge of web navigation and QA environments
Key Terms
LAA: LLM-augmented Autonomous Agent—an agent that uses an LLM as its core controller to generate actions.
BOLAA: The proposed multi-agent architecture where a controller orchestrates specialized labor agents (e.g., separate agents for clicking vs. searching).
ReAct: Reason+Act—an agent architecture that prompts the LLM to generate a thought/reasoning trace before emitting an action.
PlanAct: An architecture where the agent generates a high-level plan before starting the interaction loop.
WebShop: A simulated e-commerce environment for evaluating web agents on searching and purchasing items.
HotPotQA: A question-answering dataset requiring multi-hop reasoning across multiple documents.
CoT: Chain-of-Thought—prompting the model to generate intermediate reasoning steps.
Hallucination: When an agent generates actions or facts not supported by the environment or observation.