Mask-Then-Predict: A self-supervised learning objective where parts of the input (table cells) are hidden, and the model must predict them using context
Serialization: The process of converting structured table data into a linear text sequence (e.g., Markdown or CSV string) for LLM input
Instruction Tuning: Training an LLM using pairs of natural language instructions and outputs to improve its ability to follow tasks
ROC-AUC: A performance metric for classification problems at various threshold settings; Area Under the Receiver Operating Characteristic Curve
R2: Coefficient of determination; a statistical measure in regression that represents the proportion of the variance for a dependent variable that's explained by an independent variable
RoPE: Rotary Positional Embedding—a method for encoding position information in Transformers that generalizes better to longer sequence lengths
Llama-2: A family of open-source Large Language Models developed by Meta
XGBoost: Extreme Gradient Boosting—a scalable tree boosting system widely used for tabular data problems
Zero-shot prediction: Attempting a task without any specific training examples for that task, relying only on the model's pre-existing knowledge