Tab-Cleaner: Weakly supervised tabular data cleaning via pre-training for E-commerce catalog

📝 Paper Summary

Data Cleaning Table Representation Learning E-commerce

Tab-Cleaner detects errors in wide, text-rich product catalogs by pre-training a hierarchical Transformer that efficiently models interactions between products, attributes, and long text descriptions without full quadratic attention.

Core Problem

Standard language models fail to capture tabular structures (row/column correlations) and cannot handle the extremely long sequences formed by concatenating rich textual attributes in wide product catalogs.

Why it matters:

Product catalogs are self-reported by retailers, inevitably containing noisy facts that harm downstream applications like search and recommendation.
Attributes are strongly correlated (e.g., 'Flavor' vs 'Ingredient'), meaning simple per-column anomaly detection fails.
Existing table models truncate inputs (e.g., at 512 tokens), causing information loss for wide tables with long descriptions.

Concrete Example: A 'Tortilla Chips' product listing has 'Spicy Queso' as flavor but 'Cheddar' in another field, or a non-edible 'Sippy Cup' listing erroneously includes a 'Flavor' attribute. Standard models treating this as a flat text sequence either truncate the context needed to spot the contradiction or fail to model the 'flavor' column's specific dependency on 'ingredients'.

Key Novelty

Hierarchical Attention for Wide Text-Rich Tables

Decomposes table modeling into two levels: first encoding individual cells (attributes) using a local+global attention window, then encoding rows (products) by attending only to cell representations ([COL] tokens).
Introduces specific pre-training objectives for tables: correcting swapped cells (structure-aware) and predicting product categories (row-aware), alongside standard masked language modeling.
Enables conditional encoding where feature attributes (short specs) explicitly attend to context attributes (long descriptions) to verify consistency.

Architecture

The hierarchical attention mechanism. (a) Cell encoding: tokens attend locally within the cell. (b) Row encoding: [CLS] attends to [COL] tokens. (c) Combined view.

Evaluation Highlights

+16% PR AUC improvement on attribute applicability classification over state-of-the-art baselines (DistillBERT, Longformer) on Amazon Product Catalog data.
+11% PR AUC improvement on attribute value validation tasks compared to baselines.
Reduces pre-training time by ~64% compared to Longformer on wide tables (21.63 vs 60.56 hours/epoch) due to sparse hierarchical attention.

Breakthrough Assessment

7/10

Strong practical contribution for the specific domain of text-rich tabular data. The hierarchical attention mechanism effectively addresses the length limitation of Transformers for wide tables, showing significant empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Given a product catalog table T where rows are products and columns are attributes, identify incorrect cells Tij (inapplicable attributes or incorrect values).

Inputs: A flattened sequence of tokens representing a row, with special tokens [CLS], [COL], and column names prepended to values.

Outputs: Binary classification probability indicating whether a specific target attribute (cell) is erroneous.

Pipeline Flow

Input Flattening (add [COL] and headers)
Hierarchical Encoder (Cell-level encoding -> Row-level encoding)
Pre-training (MLM + Cell Corruption + Category Prediction)
Fine-tuning (Binary Classification on specific attributes)

System Modules

Input Embedding Layer

Combine token, position, column-index, and header/value-type embeddings into a single vector.

Model or implementation: Custom Embedding Layer

Hierarchical Attention Encoder

Encode the table row efficiently. Level 1: Cells attend to their own tokens (local+global window). Level 2: [CLS] attends to [COL] tokens to capture row-level interactions.

Model or implementation: Modified DistilBERT Encoder

Classification Head (Fine-tuning)

Predict if an attribute is valid.

Model or implementation: 2-layer MLP with ReLU

Novel Architectural Elements

Hierarchical Attention Mechanism: Two-level attention (intra-cell then inter-cell) to enforce tabular structure and reduce complexity.
Local + Global Dilated Attention for Context Attributes: Specific attention pattern for long text fields where tokens attend locally and [COL] attends globally via dilation.
Conditional Encoding: Strategy to explicitly concatenate relevant context (extracted via string matching) to feature attributes to capture cross-cell dependencies.

Modeling

Base Model: DistilBERT

Training Method: Pre-training followed by Fine-tuning

Objective Functions:

Purpose: Learn token representations.

Formally: Standard Masked Language Modeling (MLM) loss.
Purpose: Learn cell-level validity (structure aware).

Formally: Binary classification on [COL] tokens to detect swapped cells (swapped within row or within column).
Purpose: Learn row-level semantics.

Formally: Multi-class classification on [CLS] token to predict product category.

Adaptation: Fine-tuning on labeled error detection data

Training Data:

Pre-training: Unlabeled Amazon Product Catalog (Standard Table: 3.1M products, Wide Table: 677k products).
Fine-tuning: Manually annotated datasets for Applicability (11k samples) and Value Validation (1.7k samples).

Key Hyperparameters:

learning_rate: 5e-5
batch_size: 32
epochs_pretraining: 3
+ 5 more
epochs_finetuning: 10
hidden_dim: 768
layers: 6
heads: 12
warmup_steps: 0

Compute: Pre-training time: 21.63 hours/epoch (Wide Table). Memory: ~26.5 GB.

Comparison to Prior Work

vs. DistilBERT: Tab-Cleaner handles sequences >512 tokens via hierarchical attention and adds table-specific pre-training objectives.
vs. Longformer: Tab-Cleaner uses structure-aware attention (cells vs. rows) rather than generic sliding windows, preserving tabular semantics better.
vs. TaBERT/TURL [not cited as direct baseline]: Tab-Cleaner focuses on 'text-rich' single broad tables (catalogs) with millions of rows, rather than reasoning over many small Wikipedia-style tables.

Limitations

Relies on extracting relevant context via string matching for the conditional encoding, which may miss semantic relevance not captured by keywords.
Evaluation is limited to Amazon proprietary datasets; no standard public benchmarks (like standard table cleaning datasets) were used.
Strict hierarchical attention might prevent direct token-to-token interaction across different columns, relying solely on [COL] tokens for cross-attribute information.

Reproducibility

Code availability is not provided. Datasets are proprietary Amazon Product Catalog data, though constructed from public web pages. No public URL for code or data is listed in the paper.

📊 Experiments & Results

Evaluation Setup

Binary classification of attribute errors (Applicability and Value Validation) on Amazon Product Catalog data.

Benchmarks:

Amazon Product Catalog (Standard Table) (Attribute Error Detection) [New]
Amazon Product Catalog (Wide Table) (Attribute Error Detection (Long Sequence)) [New]

Metrics:

PR AUC (Precision-Recall AUC)
ROC AUC
Recall at Precision=0.9 (R@P=0.9)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Attribute Applicability Classification results (Standard Table). Tab-Cleaner significantly outperforms all baselines in high-precision regimes.
Standard Table	PR AUC	0.593	0.613	+0.020
Standard Table	R@P=0.9	0.296	0.379	+0.083
Attribute Applicability on Wide Tables (Long Sequences). Tab-Cleaner handles length better than truncated DistilBERT and generic Longformer.
Wide Table	PR AUC	0.533	0.541	+0.008
Wide Table	R@P=0.9	0.185	0.300	+0.115
Attribute Value Validation results. Detecting incorrect values (e.g., wrong flavor) rather than just inapplicable attributes.
Standard Table	PR AUC	0.622	0.623	+0.001
Standard Table	R@P=0.9	0.011	0.059	+0.048

Main Takeaways

Hierarchical attention allows scaling to wide tables with long text without the severe performance degradation seen in DistilBERT (due to truncation) or the high compute cost of Longformer.
Pre-training with table-specific objectives (Cell Swap, Category Prediction) is crucial; ablation shows pre-training adds +0.273 to PR AUC compared to non-pretrained baselines.
NLI (Natural Language Inference) approaches perform poorly (PR AUC ~0.33 vs ~0.61 for Tab-Cleaner), proving that treating validation as sentence entailment is insufficient compared to holistic table modeling.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention)
Masked Language Modeling (MLM)
Pre-training / Fine-tuning paradigm

Key Terms

Context attributes: Long text fields describing general product info (e.g., Description, Title).

Feature attributes: Short text fields describing specific properties (e.g., Color, Size, Flavor).

Hierarchical attention: A mechanism that first attends locally within cells and then aggregates information across cells, avoiding full N^2 attention over the whole row.

Dilated attention: An attention pattern with gaps (strides) allowing a token to attend to distant tokens without processing every intermediate one, effectively increasing the receptive field.

[COL] token: A special token prepended to each attribute value to serve as the aggregate representation for that specific cell/attribute.

PR AUC: Area Under the Precision-Recall Curve, a metric suitable for imbalanced classification tasks like error detection where errors are rare.