Understanding In-Context Learning via Supportive Pretraining Data

📝 Paper Summary

In-Context Learning (ICL) Pretraining Data Analysis

The paper identifies a small subset of pretraining data that supports in-context learning and finds it is characterized by rare tokens and challenging long-range context dependencies, not domain relevance.

Core Problem

It is not well understood why the ability to learn from in-context demonstrations emerges in language models, given that they are never explicitly trained on such examples during pretraining.

Why it matters:

In-context learning is a crucial emergent ability of Large Language Models (LLMs), allowing them to adapt to tasks without parameter updates.
Prior work focuses on inference-time mechanisms or synthetic data, leaving the connection between real-world pretraining data and ICL ability largely unexplored.
Understanding the data origin of ICL could guide better pretraining data construction to enhance model capabilities.

Concrete Example: A typical pretraining instance (e.g., a Wikipedia article) looks structurally very different from an ICL prompt (e.g., 'Review: Good -> Positive, Review: Bad -> Negative'). The paper investigates which specific pretraining documents actually help the model learn to process these ICL prompts.

Key Novelty

ORCA-ICL: Gradient-based search for ICL-supportive pretraining data

Adapts an iterative gradient-based method (ORCA) to find pretraining examples whose gradients are similar to the gradients of in-context learning task data.
Performs 'perturbative continued pretraining' (very few steps) on this subset to verify it improves ICL performance without harming zero-shot capabilities.
Analyzes the identified data to discover they are not domain-relevant but contain rare tokens and challenging long-range dependencies.

Evaluation Highlights

Perturbative continued pretraining on the identified supportive subset improves ICL performance by up to 18% on specific downstream tasks.
The identified supportive data does not improve zero-shot performance (without demonstrations), confirming the selection is specific to the ICL mechanism.
Supportive data contains a higher mass of rarely occurring, long-tail tokens and lower information gain from long-range context compared to random pretraining data.

Breakthrough Assessment

7/10

Significant for offering a data-centric explanation of ICL emergence (challenging long-range context) rather than just domain relevance, though the method (gradient similarity) is computationally expensive and the scope is limited to classification tasks.

⚙️ Technical Details

Problem Definition

Setting: Identify a subset of pretraining data S from corpus D_PT such that training on S maximizes performance on downstream in-context learning tasks.

Inputs: Pretrained Language Model theta, Pretraining Corpus D_PT, Downstream Task Data D_task

Outputs: A subset of supportive pretraining instances S

Pipeline Flow

Compute Gradients of ICL Loss (on downstream task)
Compute Gradients of Pretraining Loss (on candidate pretraining data)
Select Top-k Supportive Instances (via gradient similarity)
Iterate Selection (update model slightly, repeat)
Evaluation (Perturbative Continued Pretraining)

System Modules

Gradient Computer (Selection)

Calculates gradient of the loss with respect to model parameters

Model or implementation: OPT-6.7B

Selector (Selection)

Computes cosine similarity between task gradients and pretraining gradients to pick top-k instances

Model or implementation: Dot product / Cosine Similarity

Continued Pretrainer

Updates the model parameters using the selected supportive data

Model or implementation: SGD optimizer

Novel Architectural Elements

Adaptation of ORCA (gradient matching) specifically for In-Context Learning objectives (matching gradients of ICL prompts rather than zero-shot prompts)

Modeling

Base Model: OPT-6.7B

Training Method: Perturbative Continued Pretraining (SGD on selected subset)

Objective Functions:

Purpose: Pretraining loss on selected data.

Formally: L_PT(w) = - sum log p(w_t | w_<t)
Purpose: ICL guidance loss (used for selection, not training).

Formally: L_ICL(x, y) = - log p(y | {(x_demo, y_demo)}, x)

Training Data:

Pretraining pool: 2.5M instances (approx 5B tokens) from OPT corpus
Task data: 500 examples from Natural Instructions v2 tasks (SST-2, AG News, etc.)

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 16
iterations_M: 5
+ 2 more
k_instances_per_iter: 400
total_updates: 125

Compute: Uses intra-layer model parallelism and fully sharded data parallel (implied cluster usage, exact GPU hours not reported)

Comparison to Prior Work

vs. ORCA: Adapts the method to In-Context Learning (few-shot) setting and applies it to a large generative model (OPT-6.7B) instead of BERT
vs. Shin et al. (2022): Analyzes instance-level data importance rather than coarse dataset-level correlations
vs. Chan et al. (2022): Uses real-world language data (OPT corpus) rather than synthetic image-label data

Limitations

The search space for pretraining data (2.5M instances) is a small fraction (3%) of the full OPT training corpus due to computational costs.
The method focuses only on classification tasks and does not explore generation tasks.
Gradient-based search is computationally expensive and difficult to scale to the full pretraining corpus.
Cross-task generalization results are mixed, suggesting some supportive data is task-specific.

Reproducibility

Code availability is not explicitly provided in the paper text. The method relies on computing gradients over a large pretraining corpus (2.5M instances), which is computationally intensive. The specific supportive indices are not listed in the paper.

📊 Experiments & Results

Evaluation Setup

Perturbative continued pretraining followed by In-Context Learning evaluation on downstream NLP tasks

Benchmarks:

Natural Instructions v2 (Subset) (Classification (Sentiment, Topic, Logic, etc.))

Metrics:

Accuracy
Statistical methodology: Standard deviation reported for random selection baselines (5 seeds)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Story Cloze Test	Accuracy	55.84	66.09	+10.25
SMS Spam Collection	Accuracy	54.90	45.07	-9.83
Sentiment 140	Accuracy	50.44	67.23	+16.79
Zero-shot evaluation results verify that the identified data specifically supports In-Context Learning (demonstrations), as it provides little to no benefit for Zero-Shot performance.
SST-2 (Zero-shot)	Accuracy	46.82	46.83	+0.01
Sentiment 140 (Zero-shot)	Accuracy	50.43	51.39	+0.96

Main Takeaways

Supportive pretraining data significantly improves ICL performance (up to ~18% in some cases) compared to random data, validating the search method.
The improvement is specific to the ICL setting; zero-shot performance remains largely unchanged, indicating the model learns 'how to learn from context' rather than just learning the task features directly.
Cross-task analysis shows mixed transferability: data supportive of one task (e.g., SST-2) often supports other tasks (e.g., Sentiment 140), but not always (e.g., TweetQA is hard to transfer to).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Language Model Pretraining (causal language modeling)
In-Context Learning (few-shot prompting)
Gradient Descent and Influence Functions

Key Terms

In-Context Learning (ICL): The ability of a model to perform a task by looking at a few examples (demonstrations) in the prompt without updating its weights

Supportive Pretraining Data: A specific subset of pretraining data that, if the model is trained on it, disproportionately improves performance on a target capability (here, ICL)

ORCA-ICL: The specific algorithm used in this paper to find supportive data by comparing gradients of pretraining data with gradients of ICL task data

Perturbative Continued Pretraining: Running a very small number of training steps (gradient descent) on a pre-trained model using a specific data subset to measure that subset's impact

Zero-shot Prompting: Asking the model to perform a task without providing any examples/demonstrations

Verbalizer: A mapping that converts task labels (e.g., class indices) into natural language words (e.g., 'positive', 'negative') for the language model to predict

OPT: Open Pre-trained Transformer—a suite of open-source large language models developed by Meta

Information Gain: A measure used here to quantify how much the preceding context reduces the uncertainty of the current token prediction