Can Language Models Compose Skills In-Context?

📝 Paper Summary

In-Context Learning (ICL) Compositional Generalization

Providing examples of simple tasks often hurts performance on composite tasks because models fail to recognize the composition, a failure mitigated by explicitly aligning examples to steps via Expanded Chain-of-Thought.

Core Problem

Models struggle to perform composite tasks (combining basic skills) in-context when given examples of the simple skills, often performing worse with more simple examples.

Why it matters:

The exponential number of possible task compositions makes learning each individually impossible; systems must generalize by composing known skills.
Current assumptions that providing examples of basic skills helps models solve complex queries are empirically contradicted.
Models treat relevant skill examples as interfering noise rather than useful signals for composition.

Concrete Example: For a composite task 'opposition+swap' (e.g., input '* Grow Respect #'), providing more examples of just the 'opposition' task causes the model to ignore the 'swap' operation and output only the antonyms (e.g., 'Shrink Disrespect') instead of the swapped antonyms.

Key Novelty

Expanded Chain-of-Thought (ExpCoT)

Treats simple task examples as 'composite examples with missing steps' rather than distinct tasks.
Expands all examples into a uniform Chain-of-Thought format where missing operations are marked with special placeholders (e.g., 'Step 1: ???').
Explicitly aligns each example to its corresponding step in the composition process, preventing the model from matching the query to the wrong task.

Architecture

The ExpCoT (Expanded Chain-of-Thought) algorithm procedure.

Evaluation Highlights

Increasing simple task examples (k=2 to 30) causes average accuracy drops of ~7.5% on Llama-13B for composite tasks.
ExpCoT significantly outperforms standard prompting and Naïve CoT; e.g., on Llama-2-13B, ExpCoT achieves ~60% accuracy vs <20% for Naïve CoT on specific tasks.
Inner attention analysis reveals high cosine similarity between simple and composite task queries, confirming models fail to distinguish them structurally.

Breakthrough Assessment

7/10

Reveals a counter-intuitive failure mode in ICL (simple examples hurt composition) and provides a theoretically grounded, effective fix (ExpCoT). High value for understanding ICL limitations.

⚙️ Technical Details

Problem Definition

Setting: Sequence-to-sequence in-context learning where a composite task f_0 is a composition of simple tasks f_1, ..., f_T.

Inputs: A prompt containing k_1 examples of task 1, k_2 examples of task 2, k_c examples of the composite task, and a test query x.

Outputs: The predicted output y for the composite query x.

Pipeline Flow

Input formatting (Prompt Construction)
Inference (LLM Generation)
Output Parsing

System Modules

Prompt Constructor

Selects and formats k_1, k_2 simple examples and k_c composite examples

Model or implementation: Algorithm 1 (ExpCoT formatting)

LLM Inference

Generates the response to the composite query based on the context

Model or implementation: Various (Llama, Mistral, Deepseek)

Novel Architectural Elements

Prompt engineering architecture (ExpCoT) that structurally aligns heterogeneous task examples (simple vs. composite) by padding missing steps with placeholders.

Modeling

Base Model: Llama (7B, 13B, 30B, 65B), Llama2 (7B, 13B, 70B), Mistral (7B, 8x7B), Llama3.3 (70B), Deepseek-distill (-qwen8B, -Llama2-70b)

Compute: Not reported in the paper

📊 Experiments & Results

Evaluation Setup

In-context learning evaluation on 9 composite linguistic/logical tasks (e.g., Opposition+Swap).

Benchmarks:

XSL (2024) Tasks (Synthetic Linguistic/Logical Composition)

Metrics:

Exact Match Accuracy
Task Correspondence (accuracy w.r.t simple vs composite task logic)
Statistical methodology: Reported average accuracy across 4 random shufflings of examples.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Impact of increasing simple task examples (k) on composite task performance. Shows negative trend.
Average across tasks	Accuracy	Approx. 25.0	Approx. 17.5	-7.5
Ablation study on component correspondence. Shows models default to simple tasks.
Opposition+Swap	Correspondence to Task 1	Approx. 20.0	Approx. 80.0	+60.0
Performance of ExpCoT method vs Baselines.
Opposition+Swap (Llama-2-13B)	Accuracy	18.8	60.0	+41.2
Opposition+Swap (Llama-65B)	Accuracy	30.0	86.2	+56.2

Experiment Figures

Accuracy trends as the number of simple task examples (k) or composite examples (k_c) increases.

Cosine similarity matrix of inner attention outputs for simple vs. composite queries.

Main Takeaways

Simple task examples act as 'interfering noise' rather than helpful signals for composite tasks in standard ICL.
Models rely heavily on surface-level operators (syntax) rather than semantic content to match queries to examples.
The failure is structural: models do not align the simple examples to the correct step in the composite process, often executing only the simple task.
Expanded Chain-of-Thought (ExpCoT) effectively forces this alignment, turning the negative impact of simple examples into a positive one for capable models.

📚 Prerequisite Knowledge

Prerequisites

In-Context Learning (ICL)
Chain-of-Thought (CoT) Prompting
Transformer Attention Mechanisms

Key Terms

In-Context Learning (ICL): The ability of language models to learn tasks from a few examples in the prompt without parameter updates.

Chain-of-Thought (CoT): Prompting strategy where the model generates intermediate reasoning steps before the final answer.

Composite Task: A task that requires applying a sequence of simple functions (skills) to an input (e.g., first find the antonym, then swap the words).

Simple Task: A basic functional mapping (e.g., finding the antonym of a word).

Naïve CoT: Standard Chain-of-Thought where composite examples are broken into steps, but simple task examples remain in their original input-output format.

ExpCoT: Expanded Chain-of-Thought—a method that formats simple task examples as composite chains with missing steps marked by placeholders to align them structurally.

Inner Attention: The internal attention weights of the Transformer model, analyzed here to see which examples the model focuses on.