Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing

📝 Paper Summary

Data Preprocessing (DP) with LLMs Instruction Tuning for Structured Data

Jellyfish instruction-tunes small, local LLMs (7B–13B) specifically for data preprocessing tasks using knowledge injection and reasoning distillation, offering a secure alternative to cloud-based GPT APIs.

Core Problem

Existing LLM-based data preprocessing solutions rely on external APIs (like GPT-4), creating data privacy risks and high costs, while non-LLM methods lack generalization across different tasks.

Why it matters:

Data breach concerns prevent sensitive industries (finance, healthcare) from using powerful API-based LLMs for cleaning data
Specialized domains often require custom fine-tuning which is impossible or prohibitively expensive with closed-source APIs
Previous non-LLM methods are task-specific, requiring separate models for error detection, matching, and imputation

Concrete Example: In Entity Matching, a standard model might fail to match 'Sequoia American Amber Ale' and 'Aarhus Cains Triple A' because it lacks reasoning about brewery variations. Jellyfish incorporates reasoning data to explicitly explain why product names differ, improving decision accuracy.

Key Novelty

Universal Local DP Solver with Reasoning Distillation

Instruction-tunes open local models (Mistral, Llama 3) on a unified collection of four distinct preprocessing tasks (Error Detection, Imputation, Schema/Entity Matching)
Injects domain knowledge (e.g., 'missing values are not matches') directly into prompts during tuning to prevent common failures
Distills reasoning capabilities from a larger teacher model (Mixtral-8x7B) so the smaller local model can explain its preprocessing decisions

Architecture

Overview of the instruction tuning pipeline: preparing data from raw datasets, creating instruction prompts with knowledge injection and reasoning, tuning base LLMs, and inference on seen/unseen tasks.

Evaluation Highlights

Jellyfish-13B outperforms previous best non-LLM methods on all 19 tested datasets (seen and unseen), establishing a new state-of-the-art for local models
Jellyfish-13B achieves 86.02 average score across seen tasks, surpassing GPT-3.5 (84.17) and rivaling GPT-4 (not directly averaged but close on specific tasks)
Zero-shot generalization to unseen tasks (Column Type Annotation) improves by +25.6 points (Jellyfish-13B vs base model), showing transfer learning capability

Breakthrough Assessment

8/10

Strong practical contribution: demonstrates that small, secure local models can beat GPT-3.5 and specialized non-LLM baselines on dirty data tasks. The reasoning distillation and knowledge injection are effective adaptations.

⚙️ Technical Details

Problem Definition

Setting: Given a dirty relational record or pair of records/attributes, predict a cleaning action (error presence, missing value, match status) or extraction result.

Inputs: Prompt containing system message, task description, injected knowledge, and serialized instance content (e.g., 'Product A: [...] Product B: [...]')

Outputs: Natural language response containing the decision (Yes/No/Value) and optionally a reasoning explanation

Pipeline Flow

Data Serialization (Table Row → Text Prompt)
Knowledge Injection (Add rules/constraints to Prompt)
Jellyfish Model Inference (Text Prompt → Cleaning Decision + Reasoning)
Output Parsing (Extract final answer)

System Modules

Serializer

Convert structured data records into natural language prompts

Model or implementation: Rule-based template

Jellyfish Model

Generate cleaning decision and reasoning

Model or implementation: Jellyfish-7B (Mistral), 8B (Llama 3), or 13B (OpenOrca-Platypus2)

Novel Architectural Elements

Knowledge Injection mechanism: Explicitly incorporating domain rules (e.g., error types, matching constraints) into the instruction prompt during both tuning and inference to guide the model

Modeling

Base Model: Mistral-7B-Instruct-v0.2, Llama-3-8B-Instruct, OpenOrca-Platypus2-13B

Training Method: Supervised Fine-Tuning (Instruction Tuning) with LoRA

Adaptation: LoRA (rank=32, alpha=32)

Trainable Parameters: LoRA adapters targeting q_proj, k_proj, v_proj, o_proj

Training Data:

Selected subsets from standard DP benchmarks (Adult, Hospital, Buy, Restaurant, MIMIC-III, Magellan repo)
Total pool constrained to <115k instances to match baselines
Reasoning data generated by Mixtral-8x7B-Instruct-v0.1 (distillation)

Key Hyperparameters:

learning_rate: 3e-5
batch_size: 2 (per device)
num_epochs: 5
+ 3 more
lora_rank: 32
lora_alpha: 32
gradient_accumulation_steps: 2

Compute: Tuning: 3-5 hours on 8x A100 80GB. Inference: 0.07-0.15s per instance on single A100.

Comparison to Prior Work

vs. HoloDetect/Raha: Jellyfish is a single unified model for multiple tasks vs. task-specific pipelines
vs. Ditto: Jellyfish instruction-tunes a generative LLM vs. fine-tuning a BERT-based classifier [Ditto not explicitly cited as generative]
vs. GPT-4: Jellyfish runs locally for privacy vs. API dependency
+ 1 more
vs. Table-GPT: Jellyfish explicitly incorporates reasoning traces in training data

Limitations

Instance-based processing may be inefficient for very large datasets compared to batch/table-based non-LLM methods
Reasoning data degraded performance for the 13B model (OpenOrca-Platypus2), forcing a trade-off between reasoning capability and raw DP accuracy
Potential for error propagation: if Error Detection fails, subsequent Imputation might introduce new noise

Reproducibility

Code: https://huggingface.co/NECOUDBFM/Jellyfish

publicly available (https://huggingface.co/NECOUDBFM/Jellyfish). Code and datasets are released. Hyperparameters provided in Appendix. Base models are open weights.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on seen/unseen datasets for local models; Few-shot for API baselines.

Benchmarks:

Magellan Repository (Entity Matching)
Adult / Hospital (Error Detection)
Buy / Restaurant (Data Imputation)
MIMIC-III / Synthea (Schema Matching)

Metrics:

F1 score (ED, EM, AVE)
Accuracy (DI)
Micro-F1 (CTA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Jellyfish models consistently outperform task-specific non-LLM baselines and older GPT models on seen datasets, with the 13B model often rivaling GPT-4.
Adult (Error Detection)	F1	99.10	99.33	+0.23
Buy (Data Imputation)	Accuracy	96.50	100.00	+3.50
MIMIC-III (Schema Matching)	F1	20.00	40.00	+20.00
Fodors-Zagats (Entity Matching)	F1	100.00	100.00	0.00
Generalization to unseen tasks (CTA and AVE) shows Jellyfish models significantly outperforming their base models and competitive baselines.
SOTAB (Column Type Annotation)	Micro-F1	23.49	83.00	+59.51
AE-110k (Attribute Value Extraction)	F1	55.77	59.55	+3.78

Experiment Figures

Impact of single-task data size on performance across all tasks.

Impact of reasoning data size on DP performance for 7B, 8B, and 13B models.

Main Takeaways

Jellyfish-13B is a superior universal solver, beating specialized non-LLM methods on all 19 datasets tested
Instruction tuning on data preprocessing tasks transfers well: massive gains on unseen tasks like Column Type Annotation (+59.5 points for 7B model)
Knowledge injection is critical: disabling it drops 13B model performance significantly (e.g., from 99.33 to 72.00 on Adult ED)
Reasoning data helps smaller models (7B/8B) but can hurt larger/different architectures (13B), suggesting a 'tax' on reasoning for some models where logic conflicts with the base model's priors

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of data cleaning tasks (Error Detection, Entity Matching)
Familiarity with Instruction Tuning and LoRA
Knowledge of LLM prompting strategies (few-shot, reasoning)

Key Terms

DP: Data Preprocessing—tasks like cleaning, integrating, and transforming raw data into a usable format

ED: Error Detection—identifying incorrect values in a dataset

DI: Data Imputation—filling in missing values in a dataset

SM: Schema Matching—identifying whether two database columns (attributes) refer to the same concept

EM: Entity Matching—identifying whether two records refer to the same real-world entity

CTA: Column Type Annotation—inferring the semantic type (e.g., 'city', 'price') of a table column

AVE: Attribute Value Extraction—extracting specific attribute values from unstructured text descriptions

Knowledge Injection: Adding explicit rules or domain constraints (e.g., 'treat N/A as non-match') into the prompt during training/inference

Instance Serialization: Converting structured table rows into a string format (e.g., 'col: val') for LLM input

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small adapter matrices

Prefix Caching: A vLLM optimization that reuses the computation of the common prompt prefix across a batch of requests to speed up inference