Concise and Precise Context Compression for Tool-Using Language Models

📝 Paper Summary

Memory recall Tool-use post-training

A context compression framework for tool-using models that combines selective retention of key terms (like names) with block-wise soft compression to achieve high compression ratios without losing functional precision.

Core Problem

Tool documentation is lengthy and static, consuming valuable context window space and slowing decoding. Existing soft compression methods cause 'key information loss' (e.g., hallucinating parameter names) and lack flexibility for variable documentation lengths.

Why it matters:

Lengthy documentation for multiple tools can easily exceed input windows (thousands of tokens), making complex tool-use prohibitive
Standard compression treats all text equally, but in tool use, a misspelled parameter name causes immediate execution failure, unlike general text summarization where approximation is acceptable
Existing methods typically compress to fixed lengths, which is inefficient for short tools and lossy for long ones

Concrete Example: When compressing documentation for an API, a standard soft compressor might summarize the functionality well but Hallucinate the parameter name 'user_id' as 'uid'. Consequently, the generated function call fails during execution.

Key Novelty

Selective and Block-wise Context Compression

Selective Compression: Identifies 'key information' (tool/parameter names) and preserves them as raw text tokens, while compressing descriptive text into soft vectors
Block Compression: Splits documentation into chunks based on a target compression ratio rather than a fixed target length, allowing flexible handling of diverse document sizes

Architecture

The concise and precise context compression framework, illustrating the interaction between Selective Compression and Block Compression.

Evaluation Highlights

Achieves comparable performance to the upper-bound baseline (uncompressed full context) under up to 16x compression ratio on API-Bank and APIBench
Selective compression significantly mitigates key information loss compared to standard soft compression
Block compression introduces no additional performance loss compared to overall compression while enabling length flexibility

Breakthrough Assessment

7/10

Addresses a critical bottleneck in agentic AI (context length) with a practical, domain-aware solution. The 16x compression ratio with negligible loss is significant, though the method relies on known soft-prompting techniques.

⚙️ Technical Details

Problem Definition

Setting: Compress a tool documentation token sequence T into a shorter representation C containing soft tokens, such that a decoder can accurately generate function calls

Inputs: Tool documentation T (functionality descriptions, parameters)

Outputs: Structured function call defined by the documentation

Pipeline Flow

Documentation Splitter (Separates Key Info vs. Compressible Text)
Block Chunker (Chunks compressible text based on ratio r)
Compressor (Encodes chunks into soft tokens)
Sequence Assembler (Interleaves Raw Key Tokens + Soft Tokens)
Decoder (Generates function call)

System Modules

Documentation Splitter

Identifies key information (tool/parameter names) to preserve as raw text

Model or implementation: Rule-based/heuristic (implied by definition of keys)

Compressor

Compresses plain text blocks into soft summary tokens

Model or implementation: Pre-trained Language Model (weights shared with Decoder)

Decoder

Generates the structured function call using the compressed context

Model or implementation: Pre-trained Language Model (weights shared with Compressor)

Novel Architectural Elements

Interleaved context representation: The context seen by the decoder is a hybrid sequence of learnable soft embeddings (for descriptions) and discrete raw tokens (for API names)
Parallel block compression: Compressing documentation chunks independently and concatenating results to handle variable lengths

Modeling

Base Model: Pre-trained Language Model (specific family not named in text, likely LLaMA based on citations)

Training Method: Continual pre-training and fine-tuning pipeline

Objective Functions:

Purpose: Optimize the model to generate correct function calls given compressed context.

Formally: Standard language modeling (cross-entropy) loss on the decoder output.
Purpose: (Optional Variant) Ensure compressed tokens retain semantic meaning.

Formally: Auxiliary reconstruction loss to recover raw text T from soft tokens C (following Ge et al., 2023).

Training Data:

Pre-training data randomly chunked as key blocks and plain blocks to simulate the compression structure

Key Hyperparameters:

compression_ratio: Up to 16x

Compute: Not reported in the paper

Comparison to Prior Work

vs. Ge et al. / Chevalier et al.: Proposed method adds 'Selective Compression' (keeping raw tokens for sensitive entities) to prevent hallucination of API names, which pure soft compression struggles with.
vs. LLMLingua: Uses soft summary tokens (embeddings) rather than discrete token pruning (hard deletion).
vs. Standard Soft Compression: Uses 'Block Compression' to support variable compression ratios rather than fixed-length summaries.

Limitations

The last chunk in block compression may not be full, slightly affecting the exact compression ratio target.
Requires re-training/fine-tuning the model to act as both compressor and decoder; cannot be applied zero-shot to black-box APIs.
Requires pre-processing to identify 'key information', which implies a need for parsers or heuristics specific to the documentation format.

Reproducibility

No code URL provided in the text. The method relies on splitting documentation into 'key' and 'plain' parts, but the exact heuristics for identifying key parts (names/parameters) are not detailed in the snippet.

📊 Experiments & Results

Evaluation Setup

Tool-use evaluation where models must generate correct function calls based on compressed documentation

Benchmarks:

API-Bank (Tool-using / Function calling)
APIBench (Tool-using / Function calling)

Metrics:

Performance (Metric not specified in snippet, likely Accuracy or Success Rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
API-Bank	Relative Performance to Upper Bound	100	100	0
APIBench	Relative Performance to Upper Bound	100	100	0

Main Takeaways

High Compression Ratio: The method supports up to 16x compression of tool documentation while maintaining performance comparable to using the full, uncompressed documentation.
Importance of Selective Compression: Retaining key information (names/parameters) as raw text is crucial; without it, compression loss leads to failed function calls.
Efficiency of Block Compression: Splitting documentation into blocks allows for controllable compression ratios and handles variable length documents without performance degradation compared to holistic compression.

📚 Prerequisite Knowledge

Prerequisites

Soft prompts / Soft tokens (continuous vector representations of text)
Function calling / Tool use in LLMs
Transformer-based language modeling

Key Terms

soft context compression: Encoding a long text sequence into a smaller sequence of continuous vector embeddings (soft tokens) rather than discrete text tokens

key information: Specific terms in tool documentation that must be exact for execution, specifically defined in this paper as names of tools and parameters

selective compression: A strategy where key information is kept as original text tokens, while other context is compressed into soft tokens

block compression: Dividing the input text into chunks and compressing each chunk independently to a fixed number of soft tokens, allowing for a consistent compression ratio across variable-length documents

compressor: The language model component responsible for encoding raw text chunks into soft token embeddings

decoder: The language model component that uses the compressed soft tokens to generate the final response (function call)