DATA4LLM: The domain of using data management techniques (processing, storage, serving) to support the lifecycle of Large Language Models
LLM4DATA: The domain of using Large Language Models to enhance data management tasks (cleaning, integration, system optimization)
IaaS: Inclusiveness, Abundance, Articulation, Sanitization—the four essential dimensions proposed for assessing LLM dataset quality
RAG: Retrieval-Augmented Generation—systems that retrieve external documents to ground LLM responses
KV-cache: Key-Value cache—storing intermediate attention calculations during LLM inference to avoid re-computation
SFT: Supervised Fine-Tuning—training an LLM on labeled examples to follow instructions
CoT: Chain-of-Thought—prompting technique where the model generates intermediate reasoning steps
BO: Bayesian Optimization—a strategy for global optimization of black-box functions, often used in system tuning
RL: Reinforcement Learning—training agents to take actions in an environment to maximize cumulative reward
PII: Personally Identifiable Information—sensitive data that must be filtered from training sets