Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity

📝 Paper Summary

Personalized LLM Agents with Memory Tool Learning Benchmarks

ETAPP is a benchmark evaluating how LLM agents utilize tools to serve users by integrating hierarchical memory (profiles and tool preferences) and proactive reasoning, assessed via a key-point-based LLM judge.

Core Problem

Current benchmarks evaluate either text personalization or generic tool use in isolation, failing to assess if agents can proactively use tools based on user-specific history and preferences.

Why it matters:

Generic assistants often fail to anticipate unspoken user needs, requiring users to issue explicit, burdensome instructions for every step
Existing evaluations focus on text generation quality rather than the correctness and personalization of the tool interaction process itself
Standard LLM-as-a-judge methods lack reliability in assessing subtle traits like proactivity without explicit ground-truth guidance

Concrete Example: When a user asks for fruit recommendations, a standard agent simply lists popular fruits. A personalized agent checks the user's 'Mediterranean diet' preference and proactively calls 'get_user_recent_workout_records' to tailor suggestions to their fitness goals.

Key Novelty

Evaluation of Tool-augmented Agent from the Personalization and Proactivity Perspective (ETAPP)

Introduces 'Proactivity' as a core metric: measuring if the agent performs helpful, unrequested actions (e.g., checking a calendar before scheduling) based on user context
Implements a 'Key-point-based LLM evaluation' where the judge model is fed manually annotated constraints (key points) for each test case to reduce grading variance
Constructs a hierarchical memory system splitting user data into high-level 'User Profiles' and low-level 'Tool-utilizing Preferences' for precise context retrieval

Architecture

The Inference and Evaluation framework. It shows how User Profile, Tool Preferences, and Query flow into the Model, which interacts with the Sandbox. The output is then judged by an Evaluator LLM using manually annotated Key Points.

Evaluation Highlights

DeepSeek-V3 (ReAct) achieves the highest scores in Tool-Retrieval settings (3.82 Procedure, 3.54 Personalization, 1.65 Proactivity), outperforming GPT-4o slightly
Fine-tuning Qwen2.5-7B with ReAct data improves Procedure score by +25.8% on in-domain tasks compared to the vanilla model
Key-point-based evaluation increases agreement with human raters, with 89.6% of Proactivity scores falling within a 1-point difference (vs. control group)

Breakthrough Assessment

7/10

Significant contribution to agentic personalization by formalizing 'Proactivity' metrics and improving LLM-based evaluation reliability. However, the scope is limited to a simulated sandbox rather than real-world APIs.

⚙️ Technical Details

Problem Definition

Setting: Personalized tool-use environment where an agent must select and invoke APIs to satisfy a user query Q, conditioned on user profile and history.

Inputs: User query Q, Available Tools T, High-level User Profile P_h, Tool-utilizing Preferences P_t, User State C (time/location), Interaction History (9 days)

Outputs: A sequence of tool actions (function calls) and a final natural language response

Pipeline Flow

Memory Retrieval (Fetch User Profile + Tool Preferences)
Tool Retrieval (Select relevant APIs from 33 available)
Reasoning & Execution (LLM plans and calls tools in loop)
Response Generation (Final answer integrating tool outputs)

System Modules

Memory Manager

Injects hierarchical user information into context

Model or implementation: N/A (Data Structure)

Tool Retriever

Selects relevant tools from the library

Model or implementation: Dense Retriever (implied)

Agent (LLM)

Generates thoughts and tool calls

Model or implementation: Evaluated Models (GPT-4o, DeepSeek-V3, Qwen2.5, etc.)

Sandbox Environment

Executes API calls and returns observations

Model or implementation: Python Sandbox

Novel Architectural Elements

Hierarchical preference injection: separating 'User Profile' (static) from 'Tool-utilizing Preferences' (dynamic/category-specific) to optimize context usage
Key-point injection mechanism: Providing ground-truth 'required actions' to the evaluator LLM to calculate scores

Modeling

Base Model: Evaluated: GPT-4o, DeepSeek-V3, Llama-3.1-70B-Instruct, Qwen2.5-72B-Instruct. Fine-tuned: Qwen2.5-7B-Instruct.

Training Method: Supervised Fine-Tuning (SFT)

Adaptation: Full fine-tuning (implied for the 7B model experiments)

Trainable Parameters: Not reported in the paper

Training Data:

200 annotated data points (from the 800 total)
Split into ReAct format and FC format for comparison

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolLLM: ETAPP adds user-specific memory (profiles/history) and evaluates 'Proactivity' (unprompted helpfulness) rather than just success rate
vs. RoleLLM: ETAPP evaluates functional execution in a sandbox, not just conversational style matching
vs. BFCL: ETAPP requires the model to infer parameters from long-term memory/preferences rather than explicit instructions

Limitations

Evaluation relies on LLM-as-a-judge, which may still have biases despite the key-point method.
The sandbox environment is simulated and may not capture the full complexity or latency of real-world API interactions.
The dataset size (800 cases) is relatively small compared to general tool benchmarks.
Does not consider multimodal tasks (images, audio).

Reproducibility

Code: https://github.com/hypasd-art/ETAPP

Code and dataset are publicly available at https://github.com/hypasd-art/ETAPP. The repository contains the sandbox environment, the 800 test cases, and the evaluation scripts. Model weights for the fine-tuned Qwen2.5-7B are not explicitly linked in the paper text.

📊 Experiments & Results

Evaluation Setup

Sandbox tool use environment with 33 APIs. Two modes: 'Tool-Given' (relevant tools provided) and 'Tool-Retrieval' (model must search for tools).

Benchmarks:

ETAPP (Personalized Tool Use & Planning) [New]

Metrics:

Procedure (PRC): Completeness and accuracy of solution (0-5)
Personalization (PSN): Incorporation of user preferences (0-5)
Proactivity (PTV): Anticipating needs beyond explicit instructions (0-5)
Statistical methodology: Bland-Altman analysis used to validate agreement between Human and LLM Judge.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of various LLMs using ReAct in the Tool-Retrieval setting (harder, more realistic setting).
ETAPP (Tool-Retrieval)	Procedure (PRC)	3.70	3.82	+0.12
ETAPP (Tool-Retrieval)	Personalization (PSN)	3.43	3.54	+0.11
ETAPP (Tool-Retrieval)	Proactivity (PTV)	1.56	1.65	+0.09
Impact of Fine-Tuning (FT) on Qwen2.5-7B-Instruct. 'ID' = In-Domain (seen user/instruction types), 'OOD' = Out-of-Domain.
ETAPP (Subset)	Procedure (PRC) on ID Data	2.76	3.47	+0.71
ETAPP (Subset)	Proactivity (PTV) on ID Data	1.35	1.99	+0.64
ETAPP (Subset)	Procedure (PRC) on OOD Data	2.91	3.52	+0.61

Experiment Figures

Performance comparison of FC (Function Calling), ReAct, and E-ReAct (Enhanced ReAct) across the three metrics.

Bland-Altman plot measuring agreement between Human and LLM evaluation with Key Points.

Main Takeaways

ReAct prompting consistently outperforms Function Calling (FC) in personalization and proactivity, as the reasoning trace helps the model justify tool choices based on user history.
Reasoning models like DeepSeek-R1 and QwQ performed surprisingly poorly (e.g., DeepSeek-R1 scored 0.93 PRC in Tool Retrieval), likely because they answer directly without invoking tools or over-think without adhering to tool protocols.
Fine-tuning effectively teaches the 'tool invocation process' (Procedure) even for OOD data, but 'Proactivity' is harder to generalize to new scenarios.
Inputting only relevant 'Needed' preferences performs competitively with inputting 'All' preferences (PRC 4.16 vs 4.09), validating the efficiency of the hierarchical memory design.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Tool Learning / Tool-Augmented LLMs
Familiarity with ReAct (Reasoning + Acting) prompting
Basics of LLM-as-a-judge evaluation frameworks

Key Terms

Proactivity: The capability of an agent to anticipate and suggest actions beyond the user's explicit request to help complete tasks more comprehensively (e.g., checking health status before recommending food).

Key-point-based Evaluation: An evaluation method where the LLM judge is provided with specific, manually written criteria (key points) that the model output must satisfy, rather than grading openly.

ReAct: Reasoning and Acting—a prompting strategy where the model generates a thought/reasoning trace before taking an action (calling a tool).

FC: Function Calling—a mode where the model outputs a structured API call directly without an explicit reasoning trace.

Sandbox: A controlled, simulated environment (virtual API system) that isolates experiments from external variables like internet connection or real-world state changes.

ID vs OOD: In-Domain (seen during training/fine-tuning) versus Out-of-Domain (unseen users or instructions).