← Back to Paper List

Understanding In-Context Learning via Supportive Pretraining Data

Xiaochuang Han, Daniel Simig, Todor Mihaylov, Yulia Tsvetkov, Asli Celikyilmaz, Tianlu Wang
Meta AI, University of Washington
arXiv (2023)
Pretraining Benchmark

📝 Paper Summary

In-Context Learning (ICL) Pretraining Data Analysis
The paper identifies a small subset of pretraining data that supports in-context learning and finds it is characterized by rare tokens and challenging long-range context dependencies, not domain relevance.
Core Problem
It is not well understood why the ability to learn from in-context demonstrations emerges in language models, given that they are never explicitly trained on such examples during pretraining.
Why it matters:
  • In-context learning is a crucial emergent ability of Large Language Models (LLMs), allowing them to adapt to tasks without parameter updates.
  • Prior work focuses on inference-time mechanisms or synthetic data, leaving the connection between real-world pretraining data and ICL ability largely unexplored.
  • Understanding the data origin of ICL could guide better pretraining data construction to enhance model capabilities.
Concrete Example: A typical pretraining instance (e.g., a Wikipedia article) looks structurally very different from an ICL prompt (e.g., 'Review: Good -> Positive, Review: Bad -> Negative'). The paper investigates which specific pretraining documents actually help the model learn to process these ICL prompts.
Key Novelty
ORCA-ICL: Gradient-based search for ICL-supportive pretraining data
  • Adapts an iterative gradient-based method (ORCA) to find pretraining examples whose gradients are similar to the gradients of in-context learning task data.
  • Performs 'perturbative continued pretraining' (very few steps) on this subset to verify it improves ICL performance without harming zero-shot capabilities.
  • Analyzes the identified data to discover they are not domain-relevant but contain rare tokens and challenging long-range dependencies.
Evaluation Highlights
  • Perturbative continued pretraining on the identified supportive subset improves ICL performance by up to 18% on specific downstream tasks.
  • The identified supportive data does not improve zero-shot performance (without demonstrations), confirming the selection is specific to the ICL mechanism.
  • Supportive data contains a higher mass of rarely occurring, long-tail tokens and lower information gain from long-range context compared to random pretraining data.
Breakthrough Assessment
7/10
Significant for offering a data-centric explanation of ICL emergence (challenging long-range context) rather than just domain relevance, though the method (gradient similarity) is computationally expensive and the scope is limited to classification tasks.
×