Large Language Model (LLM): A large-scale artificial intelligence system designed to understand and generate text
anchor data: A limited set of real data used as a reference or seed to generate synthetic data
Markov chain: A sequence of events where the probability of each event depends only on the state attained in the previous event
reverse-bottleneck: The paper's proposed theoretical framework analyzing how information from a generative model flows into and benefits a post-trained model, conceptually inverting traditional bottleneck compression
Information Bottleneck (IB) theory: A theoretical construct aiming to optimize learning by maximizing mutual information between inputs and targets while minimizing it with the original input
Generalization Gain via Mutual Information (GGMI): The paper's proposed concept elucidating the relationship between generalization bounds and the information gain from the synthetic generation process
task divergence: The Total Variation distance between the real target task distribution and the generative model's output distribution
generation divergence: The Total Variation distance between the generative model's raw output distribution and the final synthetic dataset distribution after curation
Total Variation (TV) distance: A statistical measure of the difference between two probability distributions
Kullback-Leibler (KL) divergence: A statistical distance measuring how one probability distribution differs from a reference distribution
Gaussian Mixture Model (GMM): A probabilistic model assuming all data points are generated from a mixture of a finite number of Gaussian distributions
PAC-Bayes: Probably Approximately Correct-Bayes framework, used to bound generalization error based on the relevance between training data and learned model parameters