Optimising Language Models for Downstream Tasks: A Post-Training Perspective

📝 Paper Summary

Parameter-Efficient Fine-Tuning (PEFT) Semi-supervised Learning Instruction Tuning Reasoning Benchmarks

This thesis proposes a suite of post-training methods—including prompt-based continued pre-training, decomposed prompt tuning, and instruction modeling—to adapt language models efficiently and robustly to downstream tasks with limited data.

Core Problem

Standard fine-tuning of Large Language Models (LLMs) often fails to leverage unlabelled data effectively, incurs high computational costs, and struggles with instruction following in low-resource settings.

Why it matters:

Fine-tuning large models on small datasets leads to overfitting and poor generalization.
The computational cost of full fine-tuning or even standard prompt tuning becomes prohibitive for real-time or resource-constrained applications.
Current evaluation benchmarks often fail to capture specific cognitive abilities like multi-hop spatial reasoning, masking model limitations.

Concrete Example: When adapting a model to a sentence-pair task, standard continued pre-training on task-related text can actually degrade performance compared to no pre-training. Similarly, prompt tuning increases inference latency due to longer input sequences.

Key Novelty

Unified suite of efficient adaptation techniques (PCP, DePT, IM)

Prompt-based Continued Pre-training (PCP): Reformulates continued pre-training as a prompt-based task to better align unlabelled data with downstream fine-tuning formats.
Decomposed Prompt Tuning (DePT): Splits soft prompts into a shorter vector and low-rank matrices to reduce sequence length (latency) while maintaining expressivity.
Instruction Modelling (IM): Applies loss to the instruction/prompt tokens (not just the output) during tuning to prevent overfitting, especially when instructions are long and outputs are short.

Evaluation Highlights

Decomposed Prompt Tuning (DePT) reduces memory usage by ~20% and training time by ~15% compared to vanilla Prompt Tuning while maintaining or exceeding performance.
Instruction Modelling (IM) boosts AlpacaEval 1.0 win rates by over 100% in low-resource settings compared to standard fine-tuning.
Prompt-based Continued Pre-training (PCP) consistently improves prompt-based fine-tuning performance in semi-supervised settings, outperforming standard self-training methods.

Breakthrough Assessment

7/10

Offers practical, model-agnostic improvements for efficiency and low-resource robustness. While not a fundamental architecture shift, the methods (especially DePT and IM) provide significant utility for deploying LLMs.

⚙️ Technical Details

Problem Definition

Setting: Adapting pre-trained Large Language Models to specific downstream tasks using labelled and unlabelled data under resource constraints.

Inputs: Task-specific instructions, optional unlabelled domain text, and limited labelled examples.

Outputs: Task-specific text generations or classification labels.

Pipeline Flow

Data Preparation (Labelled/Unlabelled)
Adaptation Stage (Method dependent: PCP, DePT, or IM)
Fine-tuning / Training
Inference / Evaluation

System Modules

Prompt-based Continued Pre-training (PCP)

Align unlabelled data with downstream task format via prompting before fine-tuning.

Model or implementation: Various (RoBERTa, T5)

Decomposed Prompt Tuning (DePT) (Fine-tuning)

Efficiently adapt model params using decomposed soft prompts.

Model or implementation: Implementation on Vision-Language and NLP models (e.g., ViT-B, RoBERTa-large)

Instruction Modelling (IM) (Fine-tuning)

Optimize instruction following by calculating loss on instruction tokens.

Model or implementation: Llama-2 variants

Novel Architectural Elements

DePT: Decomposition of the soft prompt tensor into a shorter sequence of trainable tokens plus a lightweight MLP (low-rank matrices) to modulate input embeddings, reducing effective sequence length.

Modeling

Base Model: Varies by chapter (RoBERTa, T5, Llama-2-7B, ViT-B/16)

Training Method: Various: Continued Pre-training, Prompt Tuning, Instruction Tuning

Objective Functions:

Purpose: DePT optimization.

Formally: Optimizing soft prompt P decomposed into P' and low-rank adapters with different learning rates.
Purpose: Instruction Modelling.

Formally: Loss includes summation of log-probabilities over instruction tokens x: L = - sum(log P(y|x)) - alpha * sum(log P(x)).
Purpose: PCP.

Formally: Masked Language Modeling loss applied to prompt-formatted unlabelled data.

Adaptation: Parameter-Efficient Fine-Tuning (DePT), Full Fine-tuning (IM, PCP)

Training Data:

Standard NLP benchmarks (GLUE, SuperGLUE)
Instruction tuning datasets (Alpaca, WizardLM)
Synthetic spatial reasoning dataset (StepGame)

Key Hyperparameters:

learning_rate: Varies (e.g., distinct rates for soft prompts vs low-rank matrices in DePT)
batch_size: Not explicitly summarized in overview
prompt_length: Reduced in DePT compared to standard Prompt Tuning

Comparison to Prior Work

vs. PT: DePT uses shorter soft prompts + low-rank decomposition to reduce inference latency and memory.
vs. ST: PCP leverages unlabelled data via prompt-based objectives rather than pseudo-labeling, avoiding confirmation bias and iterative retraining.
vs. SFT (for Instruction Tuning): IM calculates loss on instructions to prevent overfitting to short outputs [not cited in paper as direct competitor, but as baseline modification].

Limitations

DePT introduces additional hyperparameters (learning rates for different components) compared to vanilla Prompt Tuning.
Instruction Modelling is primarily beneficial when instruction length is long relative to output length; benefits diminish otherwise.
StepGame is a synthetic benchmark and may not fully reflect the noise of real-world spatial reasoning tasks.

Reproducibility

Code: https://github.com/amzn/pretraining-or-self-training

Code is publicly available for all major contributions: PCP (https://github.com/ZhengxiangShi/PowerfulPromptFT), DePT (https://github.com/ZhengxiangShi/DePT), Instruction Modelling (https://github.com/ZhengxiangShi/InstructionModelling), and StepGame (https://github.com/ZhengxiangShi/StepGame).

📊 Experiments & Results

Evaluation Setup

Multiple distinct evaluation settings across chapters: Semi-supervised classification, Efficient Fine-tuning, and Instruction Following.

Benchmarks:

GLUE / SuperGLUE (NLU Classification)
AlpacaEval (Open-ended Instruction Following)
StepGame (Multi-hop Spatial Reasoning) [New]

Metrics:

Accuracy
Win Rate (AlpacaEval)
Inference Latency / Memory Usage
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Instruction Modelling (IM) significantly improves win rates on open-ended generation benchmarks compared to standard training, especially in low-data regimes.
AlpacaEval 1.0	Win Rate	Not reported in the paper	Not reported in the paper	Not reported in the paper
Decomposed Prompt Tuning (DePT) achieves efficiency gains over vanilla Prompt Tuning.
Efficiency Metrics	Memory Cost reduction	1.0	0.8	-0.2

Main Takeaways

Task-Adaptive Pre-training (TAPT) is often a more robust semi-supervised baseline than complex Self-Training methods.
Prompt-based Continued Pre-training (PCP) is essential for prompt-based fine-tuning; standard continued pre-training can actually hurt performance in these setups.
Instruction Modelling (IM) is highly effective for reducing overfitting when training data has long instructions and short outputs (e.g., logic puzzles, classification tasks posed as chat).
DePT successfully decouples the trade-off between prompt expressivity (length) and computational efficiency.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture
Familiarity with Fine-tuning vs. Parameter-Efficient Fine-Tuning (PEFT)
Knowledge of Prompt Tuning and Soft Prompts
Basics of Semi-supervised learning

Key Terms

PCP: Prompt-based Continued Pre-training—Applying prompt templates to unlabelled data during pre-training to align it with the downstream fine-tuning format.

DePT: Decomposed Prompt Tuning—A PEFT method that represents soft prompts using a smaller vector plus low-rank matrices to save sequence length and compute.

IM: Instruction Modelling—A training objective where the loss is calculated on instruction tokens in addition to output tokens to improve adherence and reduce overfitting.

TAPT: Task-Adaptive Pre-training—Further pre-training an LM on unlabelled data drawn from the target task's domain.

PEFT: Parameter-Efficient Fine-Tuning—Techniques to adapt large models by updating only a small subset of parameters.

Soft Prompts: Trainable continuous vectors prepended to the input embeddings that steer the model's behavior without changing model weights.

StepGame: A benchmark dataset designed to test robust multi-hop spatial reasoning capabilities in language models.