Yang Xu, Yunlong Feng, Honglin Mu, Yutai Hou, Yitong Li, Xinghao Wang, Wanjun Zhong, Zhongyang Li, Dandan Tu, Qingfu Zhu, Min Zhang, Wanxiang Che
Harbin Institute of Technology,
Huawei Technologies Co., Ltd
arXiv
(2024)
MemoryAgentBenchmark
📝 Paper Summary
Memory recallTool-use post-training
A context compression framework for tool-using models that combines selective retention of key terms (like names) with block-wise soft compression to achieve high compression ratios without losing functional precision.
Core Problem
Tool documentation is lengthy and static, consuming valuable context window space and slowing decoding. Existing soft compression methods cause 'key information loss' (e.g., hallucinating parameter names) and lack flexibility for variable documentation lengths.
Why it matters:
Lengthy documentation for multiple tools can easily exceed input windows (thousands of tokens), making complex tool-use prohibitive
Standard compression treats all text equally, but in tool use, a misspelled parameter name causes immediate execution failure, unlike general text summarization where approximation is acceptable
Existing methods typically compress to fixed lengths, which is inefficient for short tools and lossy for long ones
Concrete Example:When compressing documentation for an API, a standard soft compressor might summarize the functionality well but Hallucinate the parameter name 'user_id' as 'uid'. Consequently, the generated function call fails during execution.
Key Novelty
Selective and Block-wise Context Compression
Selective Compression: Identifies 'key information' (tool/parameter names) and preserves them as raw text tokens, while compressing descriptive text into soft vectors
Block Compression: Splits documentation into chunks based on a target compression ratio rather than a fixed target length, allowing flexible handling of diverse document sizes
Architecture
The concise and precise context compression framework, illustrating the interaction between Selective Compression and Block Compression.
Evaluation Highlights
Achieves comparable performance to the upper-bound baseline (uncompressed full context) under up to 16x compression ratio on API-Bank and APIBench
Selective compression significantly mitigates key information loss compared to standard soft compression
Block compression introduces no additional performance loss compared to overall compression while enabling length flexibility
Breakthrough Assessment
7/10
Addresses a critical bottleneck in agentic AI (context length) with a practical, domain-aware solution. The 16x compression ratio with negligible loss is significant, though the method relies on known soft-prompting techniques.
⚙️ Technical Details
Problem Definition
Setting: Compress a tool documentation token sequence T into a shorter representation C containing soft tokens, such that a decoder can accurately generate function calls
Inputs: Tool documentation T (functionality descriptions, parameters)
Outputs: Structured function call defined by the documentation
Pipeline Flow
Documentation Splitter (Separates Key Info vs. Compressible Text)
Block Chunker (Chunks compressible text based on ratio r)
Compressor (Encodes chunks into soft tokens)
Sequence Assembler (Interleaves Raw Key Tokens + Soft Tokens)
Decoder (Generates function call)
System Modules
Documentation Splitter
Identifies key information (tool/parameter names) to preserve as raw text
Model or implementation: Rule-based/heuristic (implied by definition of keys)
Compressor
Compresses plain text blocks into soft summary tokens
Model or implementation: Pre-trained Language Model (weights shared with Decoder)
Decoder
Generates the structured function call using the compressed context
Model or implementation: Pre-trained Language Model (weights shared with Compressor)
Novel Architectural Elements
Interleaved context representation: The context seen by the decoder is a hybrid sequence of learnable soft embeddings (for descriptions) and discrete raw tokens (for API names)
Parallel block compression: Compressing documentation chunks independently and concatenating results to handle variable lengths
Modeling
Base Model: Pre-trained Language Model (specific family not named in text, likely LLaMA based on citations)
Training Method: Continual pre-training and fine-tuning pipeline
Objective Functions:
Purpose: Optimize the model to generate correct function calls given compressed context.
Formally: Standard language modeling (cross-entropy) loss on the decoder output.
Formally: Auxiliary reconstruction loss to recover raw text T from soft tokens C (following Ge et al., 2023).
Training Data:
Pre-training data randomly chunked as key blocks and plain blocks to simulate the compression structure
Key Hyperparameters:
compression_ratio: Up to 16x
Compute: Not reported in the paper
Comparison to Prior Work
vs. Ge et al. / Chevalier et al.: Proposed method adds 'Selective Compression' (keeping raw tokens for sensitive entities) to prevent hallucination of API names, which pure soft compression struggles with.
vs. LLMLingua: Uses soft summary tokens (embeddings) rather than discrete token pruning (hard deletion).
vs. Standard Soft Compression: Uses 'Block Compression' to support variable compression ratios rather than fixed-length summaries.
Limitations
The last chunk in block compression may not be full, slightly affecting the exact compression ratio target.
Requires re-training/fine-tuning the model to act as both compressor and decoder; cannot be applied zero-shot to black-box APIs.
Requires pre-processing to identify 'key information', which implies a need for parsers or heuristics specific to the documentation format.
Reproducibility
No code URL provided in the text. The method relies on splitting documentation into 'key' and 'plain' parts, but the exact heuristics for identifying key parts (names/parameters) are not detailed in the snippet.
📊 Experiments & Results
Evaluation Setup
Tool-use evaluation where models must generate correct function calls based on compressed documentation
Benchmarks:
API-Bank (Tool-using / Function calling)
APIBench (Tool-using / Function calling)
Metrics:
Performance (Metric not specified in snippet, likely Accuracy or Success Rate)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
API-Bank
Relative Performance to Upper Bound
100
100
0
APIBench
Relative Performance to Upper Bound
100
100
0
Main Takeaways
High Compression Ratio: The method supports up to 16x compression of tool documentation while maintaining performance comparable to using the full, uncompressed documentation.
Importance of Selective Compression: Retaining key information (names/parameters) as raw text is crucial; without it, compression loss leads to failed function calls.
Efficiency of Block Compression: Splitting documentation into blocks allows for controllable compression ratios and handles variable length documents without performance degradation compared to holistic compression.
📚 Prerequisite Knowledge
Prerequisites
Soft prompts / Soft tokens (continuous vector representations of text)
Function calling / Tool use in LLMs
Transformer-based language modeling
Key Terms
soft context compression: Encoding a long text sequence into a smaller sequence of continuous vector embeddings (soft tokens) rather than discrete text tokens
key information: Specific terms in tool documentation that must be exact for execution, specifically defined in this paper as names of tools and parameters
selective compression: A strategy where key information is kept as original text tokens, while other context is compressed into soft tokens
block compression: Dividing the input text into chunks and compressing each chunk independently to a fixed number of soft tokens, allowing for a consistent compression ratio across variable-length documents
compressor: The language model component responsible for encoding raw text chunks into soft token embeddings
decoder: The language model component that uses the compressed soft tokens to generate the final response (function call)