Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

📝 Paper Summary

Human-Agent Collaboration Multi-Agent Systems (MAS) Continual Learning

HILA enables multi-agent systems to strategically defer to humans via a learned metacognitive policy, utilizing a dual-loop framework that optimizes deferral decisions with RL and capability growth with continual learning.

Core Problem

Autonomous multi-agent systems are 'closed-world,' bounded by their pre-training data, making them brittle on tasks requiring new knowledge or expertise not present in their training corpora.

Why it matters:

Purely autonomous agents cannot generate genuinely new knowledge, leading to collective failure on tasks requiring real-time info or domain expertise
Current human-in-the-loop methods rely on static heuristics (e.g., confidence thresholds) for deferral rather than learned policies
Existing feedback mechanisms treat human input as one-time fixes rather than supervised signals for long-term capability growth

Concrete Example: When a multi-agent system faces a problem requiring domain-specific expertise absent from its training data, internal collaboration merely recombines existing ignorance, leading to confident but wrong answers. HILA detects this uncertainty and triggers a 'Defer' action to a human expert, then learns from the expert's response.

Key Novelty

Dual-Loop Policy Optimization (DLPO) for Metacognitive Agents

Equips agents with a 'metacognitive policy' to decide between autonomous actions (Eval, Create) and strategic deferral (asking a human)
Separates optimization into two loops: an inner RL loop (GRPO) to learn *when* to ask, and an outer Continual Learning loop to learn *what* the expert demonstrated

Architecture

Overview of HILA and Dual-Loop Policy Optimization, illustrating the coupling of multi-agent collaboration with human interaction.

Breakthrough Assessment

8/10

Proposes a principled, mathematically grounded framework (Dual-Loop) for integrating human experts into MAS, moving beyond simple heuristics to learned metacognition and continual improvement.

⚙️ Technical Details

Problem Definition

Setting: Metacognitive Markov Decision Process (Meta-MDP)

Inputs: Shared cognitive state containing task context, self context (agent's own solution), and peer context (other agents' responses)

Outputs: Metacognitive action (Eval, Create, Defer) and resulting solution sequence

Pipeline Flow

Cognitive State Construction (Task + Self + Peer Context)
Metacognitive Policy (Selects Action: Eval, Create, or Defer)
Action Execution (Internal Generation or Human Call)
State Update

System Modules

Metacognitive Policy

Decides whether to exploit existing knowledge (Eval), explore new solutions (Create), or seek help (Defer)

Model or implementation: LLM-based Policy (Shared weights with generator)

Action Executor

Executes the chosen strategy: generates a solution, selects a peer solution, or invokes the expert

Model or implementation: LLM Generator or External Human Interface

Dual-Loop Optimizer

Updates model weights based on rewards (RL) and expert data (CL)

Model or implementation: Gradient-based Optimizer

Novel Architectural Elements

Metacognitive action space definition ({Eval, Create, Defer}) replacing standard token-level action space
Integration of structured cognitive signals (social consensus, monitoring, control) into the policy state via lightweight heuristics

Modeling

Base Model: Large Language Models (Specific architecture not explicitly named in text)

Training Method: Dual-Loop Policy Optimization (DLPO)

Objective Functions:

Purpose: Optimize the policy to balance success against cost.

Formally: Inner Loop uses GRPO with reward R(s,a) = R_gt(y) - Cost(a)
Purpose: Learn from expert demonstrations to improve underlying capability.

Formally: Outer Loop uses SFT loss L_SFT = -log P(y_human | s_t)
Purpose: Joint optimization.

Formally: L_total = L_Inner + lambda_sft * I(Defer) * L_SFT

Key Hyperparameters:

cost_structure: C_defer > C_create >= 0 (Explicitly stated constraint)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Structured Debate: HILA integrates external human expertise dynamically rather than relying solely on internal consensus
vs. Traditional HITL: HILA learns a policy for *when* to ask (vs. heuristics) and uses feedback for continual learning (vs. one-time fix)

Limitations

Relies on the availability of a human expert or oracle for the 'Defer' action
Quantitative performance metrics and baselines are not present in the provided text snippet

Reproducibility

Code: https://github.com/USC-Melady/HILA.git

Code is publicly available at https://github.com/USC-Melady/HILA.git. The text snippet provided does not specify model sizes, exact training data, or compute resources.

📊 Experiments & Results

Evaluation Setup

Evaluation on mathematical and problem-solving benchmarks (as mentioned in Abstract)

Benchmarks:

Not specified in snippet (Mathematical reasoning)
Not specified in snippet (General problem-solving)

Metrics:

Not reported in the paper
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The framework establishes a principled foundation for agentic systems that can continually improve through human collaboration.
The proposed Dual-Loop Policy Optimization allows agents to balance the cost of human intervention against the risk of autonomous failure.
By treating expert feedback as supervision, the system transforms from a closed-world operator to an open-ended learner.

📚 Prerequisite Knowledge

Prerequisites

Multi-Agent Systems (MAS)
Reinforcement Learning (RL)
Markov Decision Processes (MDP)
Continual Learning

Key Terms

HILA: Human-In-the-Loop Multi-Agent Collaboration—the proposed framework for adaptive human-agent interaction

DLPO: Dual-Loop Policy Optimization—a training method combining inner-loop RL for decision-making and outer-loop supervised learning for knowledge acquisition

Meta-MDP: Metacognitive Markov Decision Process—a formalization where actions represent high-level cognitive strategies (e.g., creating vs. deferring) rather than just token generation

GRPO: Group Relative Policy Optimization—an RL algorithm used here to optimize the deferral policy by contrasting relative advantages of actions

Metacognitive Policy: A high-level policy that reasons about the agent's own competence and peer agreement to decide whether to act autonomously or seek help

SFT: Supervised Fine-Tuning—used in the outer loop to train the model on expert demonstrations provided during deferral