Don't Fine-Tune, Decode: Syntax Error-Free Tool Use via Constrained Decoding

📝 Paper Summary

Multi-call tool use with fixed plan Constrained Decoding

ToolDec eliminates syntax errors in LLM tool use by enforcing grammar constraints via a Finite State Machine during the decoding process, enabling generalist models to match specialist performance.

Core Problem

Generalist LLMs frequently fail to follow complex syntax constraints when calling tools (e.g., JSON schemas), leading to high error rates even after instruction tuning.

Why it matters:

Syntax errors prevent models from successfully executing tools, rendering reasoning capabilities useless.
Existing solutions like fine-tuning or extensive prompting are computationally expensive and struggle to generalize to new tools without retraining.
Generalist models often achieve 0% accuracy on tool benchmarks purely due to syntactic formatting failures.

Concrete Example: When using an unknown tool in ToolEval, Mistral-Instruct-7B has a syntax error rate over 90%, resulting in 0% accuracy because it fails to format arguments correctly (e.g., generating invalid JSON).

Key Novelty

ToolDec (Finite State Machine Constrained Decoding)

Constructs a Finite State Machine (FSM) from tool documentation (e.g., JSON schemas) that explicitly defines all valid token sequences.
During inference, the decoding algorithm masks out all tokens that would violate the tool's syntax, forcing the model to generate only syntactically valid calls.
Offloads syntax enforcement to the decoding algorithm, allowing the removal of complex syntax instructions from the prompt (prompt compression).

Architecture

A composite view of how ToolDec constructs an FSM from a schema and uses it during inference. Figure 3 shows the FSM structure derived from JSON. Figure 5 shows the decoding process.

Evaluation Highlights

Improves Mistral-Instruct-7B's accuracy on ToolEval from 0% to ~52%, matching the performance of the fine-tuned specialist ToolLLM.
Achieves 0 syntax errors across all tested models and benchmarks, compared to >90% error rates for some baselines.
Reduces prompt length by ~2x on ToolEval by removing syntax examples, while maintaining or improving performance.

Breakthrough Assessment

8/10

Simple yet highly effective intervention. It solves the specific sub-problem of syntax errors completely (0% error rate) without training, allowing generalist models to compete with specialists.

⚙️ Technical Details

Problem Definition

Setting: Constrained text generation where the output must conform to a specific formal grammar (e.g., JSON schema for a tool call).

Inputs: Natural language instruction and tool descriptions.

Outputs: A tool call (name and arguments) that syntactically adheres to the tool's definition.

Pipeline Flow

FSM Construction: Tool Schema → Finite State Machine
Prompt Compression: Tool Docs → Compressed Prompt (Semantic only)
Inference: LLM + Compressed Prompt + FSM → Syntactically Correct Output

System Modules

FSM Constructor (Preprocessing)

Recursively converts machine-readable tool documentation (e.g., OpenAPI) into a Finite State Machine.

Model or implementation: Algorithmic (Recursive construction)

Prompt Compressor (Preprocessing)

Rewrites tool documentation to remove syntax constraints, keeping only semantic descriptions.

Model or implementation: LLM (for rewriting)

FSM-Guided Decoder

Masks invalid tokens at each decoding step based on the current FSM state.

Model or implementation: Base LLM (e.g., Mistral, Llama) with modified sampling

Novel Architectural Elements

Syntactic FSM integrated directly into the LLM decoding loop to filter logits based on API schemas.
Trie-based structure for guiding tool selection among multiple available tools.

Modeling

Base Model: Evaluated on Mistral-7B-Instruct, LLaMA-7B, ToolLLM (LLaMA-7B based), ToolkenGPT (Llama-33B based), RestGPT

Training Method: Inference-only decoding intervention (no parameter updates)

Adaptation: None (applied at inference time)

Trainable Parameters: 0

Compute: Negligible overhead (<0.1%) compared to LLM GPU cost; FSM construction is cached.

Comparison to Prior Work

vs. ToolLLM/ToolkenGPT: ToolDec requires no training/fine-tuning and guarantees 0 syntax errors, whereas fine-tuning only reduces them.
vs. Prompt Engineering: ToolDec allows removing syntax instructions from prompts (saving tokens), whereas prompting requires verbose examples.
vs. General Constrained Decoding (e.g., outputting valid JSON): ToolDec enforces specific schema constraints (parameter types, required fields) derived from documentation, not just generic JSON syntax.

Limitations

Depends on the availability of machine-readable tool documentation (e.g., OpenAPI).
FSM construction for extremely complex or non-regular grammars might be non-trivial (though covers 16k+ REST APIs in paper).
Does not improve the semantic reasoning capability of the model (choosing the right tool), only the syntactic correctness of the call.

Reproducibility

Code: https://github.com/chenhongqiao/tooldec

📊 Experiments & Results

Evaluation Setup

Tool use evaluation across diverse benchmarks requiring API calls.

Benchmarks:

ToolEval (Complex tool use with REST APIs (I2-Category, I3-Instruction subsets))
FuncQA (Numerical reasoning using arithmetic tools)
KAMEL (Knowledge relation QA treated as API calls)
RestBench (Real-world scenarios (TMDB, Spotify) requiring multiple API calls)

Metrics:

Win Rate (vs ChatGPT)
Pass Rate
Syntax Error Rate
Tokens per Tool (tok/tool)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ToolDec enables generalist models (Mistral) to go from 0% to competitive performance on ToolEval by fixing syntax errors.
ToolEval (I2-Cat)	Win Rate	0.0	52.2	+52.2
ToolEval (I2-Cat)	Pass Rate	0.0	56.4	+56.4
ToolDec improves specialist models (ToolLLM) by ensuring 0% syntax errors.
ToolEval (I2-Cat)	Win Rate	41.6	51.3	+9.7
ToolEval (I2-Cat)	Syntax Error Rate	21.6	0.0	-21.6
Efficiency improvements via prompt compression enabled by ToolDec.
ToolEval	tok/tool	456.9	210.3	-246.6
Performance on FuncQA numerical reasoning.
FuncQA	Accuracy	13.2	43.1	+29.9

Experiment Figures

Comparison of Win Rate and Syntax Error between Mistral-Instruct, ToolLLM, and ToolDec variants on ToolEval.

Main Takeaways

ToolDec completely eliminates syntax errors (0% rate) across all tested benchmarks and models.
Generalist models (like Mistral) effectively become tool-use specialists without any fine-tuning when equipped with ToolDec.
Allows for significant prompt compression (reduction of input tokens) because syntax constraints don't need to be described in text.
The approach is model-agnostic and computationally efficient (<0.1% overhead).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and token sampling
Basic knowledge of Finite State Machines (FSMs) and Regular Expressions
Familiarity with REST APIs and JSON format

Key Terms

FSM: Finite State Machine—a computational model used here to map valid transitions between tokens to enforce grammar rules.

Trie: A tree data structure used to efficiently store and retrieve keys in a dataset of strings; used here to manage valid tool names.

JSON Schema: A declarative language that allows you to annotate and validate JSON documents.

Constrained Decoding: A technique where the set of possible next tokens is restricted to only those that satisfy certain constraints.

ToolEval: A benchmark dataset involving 10,000+ real-world REST APIs for evaluating tool-use capabilities.

ToolLLM: A LLaMA-7B model fine-tuned specifically for tool use on RapidAPI data.

ReAct: Reasoning and Acting—a paradigm where models generate reasoning traces and task-specific actions in an interleaved manner.

DFSDT: Depth-First Search Decision Tree—a search strategy used by ToolLLM to explore multiple reasoning paths.

ToolkenGPT: A method that learns specific embeddings (toolkens) for tools to facilitate their usage.

Mistral-Instruct: A generalist instruction-tuned language model based on the Mistral architecture.