Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation

📝 Paper Summary

Multi-call tool use with flexible plan Code Generation

ToolGen teaches Code LLMs to invoke static analysis-based autocompletion tools during generation by fine-tuning on functions augmented with special trigger tokens, resolving repository-level dependency errors.

Core Problem

Code LLMs generating repository-level code lack awareness of project-specific dependencies (user-defined attributes/functions), leading to undefined-variable and no-member errors.

Why it matters:

Over 70% of functions in real-world repositories are not standalone, making standard Code LLMs ineffective for practical software engineering.
Existing tool-use methods (like ToolFormer) struggle with repository dependencies because they rely on generic APIs rather than context-aware program analysis.
Dependency errors (e.g., hallucinating non-existent class members) significantly impede the usability of generated code.

Concrete Example: When generating a method using 'self.', a standard Code LLM might hallucinate an attribute '_updates' (causing a no-member error). In contrast, ToolGen invokes a static analysis tool (Jedi) at 'self.', which inspects the class definition and correctly suggests the existing attribute '_registered_updates'.

Key Novelty

ToolGen: Repository-Aware Tool Integration via Trigger Insertion

Fine-tunes Code LLMs to predict a special <COMP> token at specific positions where accessing repository dependencies (like class attributes), triggering an external autocompletion tool.
Integrates standard IDE-style static analysis tools (e.g., Jedi) directly into the LLM decoding loop to fetch valid identifiers from the project context.
Selects the best suggestion from the tool's list using a constrained greedy search by the LLM, bridging the gap between generative capability and strict repository constraints.

Architecture

Overview of the ToolGen approach, split into Offline (Trigger Insertion & Fine-tuning) and Online (Tool-integrated Code Generation) phases.

Evaluation Highlights

Dependency Coverage (covering real repo dependencies) improved by 31.4% to 39.1% across CodeGPT, CodeT5, and CodeLlama compared to base models.
Static Validity Rate (passing dependency checks) increased by 44.9% to 57.7% on the 12,406 function benchmark.
Achieved 40.0% (CodeT5) and 25.0% (CodeLlama) improvement in Pass@1 on CoderEval tasks involving repository dependencies.

Breakthrough Assessment

7/10

Significantly addresses the specific problem of repository-level hallucinations by effectively combining LLMs with traditional static analysis, though the scope is limited to identifier completion.

⚙️ Technical Details

Problem Definition

Setting: Repository-level function generation where the model must generate code strictly adhering to existing project dependencies.

Inputs: Natural language description D and repository context R.

Outputs: A complete function F that is syntactically correct and respects dependencies in R.

Pipeline Flow

Trigger Insertion (Offline): Parse repo -> AST -> Insert <COMP> tokens before valid identifiers
Fine-tuning (Offline): Train LLM on augmented code to predict <COMP>
Inference (Online): LLM generates tokens -> If <COMP> predicted -> Call Tool -> Select Suggestion -> Append -> Continue

System Modules

Trigger Insertion

Pre-process training data by inserting <COMP> tokens at positions where static analysis tools can provide valid suggestions (identifiers, not keywords).

Model or implementation: Algorithm 1 (AST Traversal + Jedi)

Fine-tuned Code LLM

Generate code and predict when to invoke the tool via <COMP> token.

Model or implementation: CodeGPT / CodeT5 / CodeLlama (fine-tuned)

Autocompletion Tool

Provide a list of valid identifier completions based on repository context.

Model or implementation: Jedi (Static Analysis)

Suggestion Selector

Select the best candidate from the tool's output using the LLM's probability distribution.

Model or implementation: Constrained Greedy Search

Novel Architectural Elements

Integration of static analysis tool invocation directly into the LLM decoding loop via a learned special token <COMP>.
Constraint greedy search mechanism to select optimal tool output using the LLM's own likelihoods.

Modeling

Base Model: CodeGPT, CodeT5, and CodeLlama (7B)

Training Method: Supervised Fine-Tuning (SFT) with LoRA

Objective Functions:

Purpose: Minimize the difference between predicted and actual tokens in the augmented dataset.

Formally: Standard Cross-Entropy Loss over the sequence.

Adaptation: LoRA (Low-Rank Adaptation) used for CodeLlama-7B; Full fine-tuning for smaller models.

Training Data:

Augmented Dataset: 249,298 Python functions from 12,231 repositories.
Functions augmented with <COMP> token using Jedi analysis on ASTs.

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Not reported in the paper

Compute: Inference latency 0.63 to 2.34 seconds per function (depending on model size).

Comparison to Prior Work

vs. ToolFormer: ToolGen handles multiple candidate returns (lists) rather than single values and focuses on repo-level dependencies rather than standalone calculations.
vs. ToolCoder: ToolGen uses static analysis (Jedi) for local repo context, whereas ToolCoder uses IR for global API search; ToolGen handles multi-candidate selection.
vs. Repilot: ToolGen generates entire functions (creative) rather than fixing single-hunk bugs (repair); Repilot triggers tools unnecessarily often if applied to generation, while ToolGen learns *when* to trigger.
+ 1 more
vs. Monitor-Guided Decoding [not cited in paper]: ToolGen embeds the tool trigger into the vocabulary/training, whereas monitor-guided approaches often use external heuristics to interrupt generation.

Limitations

Dependency on the quality and speed of the external autocompletion tool (Jedi).
Only focuses on identifier completion within function bodies, not other code elements or scopes.
Requires parsing the entire repository context, which can be computationally expensive for very large repos.
Evaluation limited to Python language.

Reproducibility

Code availability is not explicitly provided in the paper text (no URL in abstract or intro). The benchmark dataset is constructed from public repositories and CoderEval.

📊 Experiments & Results

Evaluation Setup

Repository-level function generation using Python repositories.

Benchmarks:

Custom Benchmark (Function Generation) [New]
CoderEval (Repository-level Code Generation (Pragmatic/Context-aware))

Metrics:

Dependency Coverage (New)
Static Validity Rate (New)
BLEU-4
CodeBLEU
Pass@1 (CoderEval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on the custom benchmark of 12,406 functions show significant improvements in repository-specific metrics (Dependency Coverage and Static Validity) across all three models.
Custom Benchmark (12k functions)	Dependency Coverage	39.8	55.4	+15.6
Custom Benchmark (12k functions)	Static Validity Rate	41.6	65.6	+24.0
Custom Benchmark (12k functions)	Dependency Coverage	45.9	60.3	+14.4
Results on CoderEval (176 tasks) measuring functional correctness via test cases.
CoderEval	Pass@1	20.5	28.7	+8.2
CoderEval	Pass@1	36.4	45.5	+9.1

Experiment Figures

Example of an augmented function used for training.

Main Takeaways

ToolGen consistently outperforms base models in handling repository dependencies without sacrificing general code quality (BLEU scores remain competitive).
The approach is model-agnostic, showing gains across decoder-only (CodeGPT, CodeLlama) and encoder-decoder (CodeT5) architectures.
Inference latency is manageable (0.63s - 2.34s per function), making it practical for real-time coding assistants.
The use of <COMP> tokens effectively teaches the model *where* dependencies are likely to occur, reducing necessary tool invocations compared to triggering at every token.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Decoder-only vs Encoder-Decoder)
Static Program Analysis (ASTs)
Language Modeling (Next token prediction)

Key Terms

Repository-level dependencies: User-defined functions, classes, and variables that exist elsewhere in the project and must be referenced correctly.

Dependency Coverage: A metric quantifying the proportion of repository-level dependencies in ground-truth code that are successfully covered by the generated code.

Static Validity Rate: A metric measuring the percentage of generated functions that pass a static dependency error check (e.g., no undefined variables).

AST: Abstract Syntax Tree—a tree representation of the abstract syntactic structure of source code.

Jedi: A static analysis tool for Python that provides autocompletion suggestions based on project context.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank-decomposition matrices.

Pass@1: The probability that the top-1 generated code solution passes the unit tests.