From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

📝 Paper Summary

Knowledge Distillation Multimodal Learning Model Compression

ARMADA improves purely text-based language models by distilling abstract knowledge from fixed, potentially black-box vision-language teachers using manifold alignment without requiring the teacher to be trained.

Core Problem

Existing cross-modal knowledge distillation methods require computationally expensive pre-training of multimodal teachers and cannot utilize powerful black-box models (like Midjourney) to enhance text-only students.

Why it matters:

Language-only models miss out on generalized concepts grounded in visual modalities
Current cross-modal KD is inefficient because it demands training the teacher on massive video/image-text datasets before distillation
High-performing commercial multimodal models are often black-boxes (API-only), making traditional white-box distillation impossible

Concrete Example: A blind person (student model) learning about the world through narration by a prompter (teacher). Existing methods require the prompter to undergo expensive training first. ARMADA allows the student to learn abstract concepts from a pre-existing, fixed prompter (even a black-box one) without prior coordination.

Key Novelty

Alignment-induced Cross-Modal Knowledge Distillation (ARMADA)

Uses a 'TS Aligner' module to map the student's text representations and the teacher's multimodal representations into a shared manifold space
Aligns topological structures of teacher and student spaces via manifold projection losses (Euclidean, Cosine) rather than just mimicking outputs
Enables distillation from black-box teachers by focusing on representation alignment through the aligner rather than requiring access to teacher weights or gradients

Architecture

The ARMADA framework pipeline showing the interaction between Teacher, Student, and TS Aligner.

Evaluation Highlights

+3.4% average improvement on GLUE/SuperGLUE for BERT-6L when distilled from a Stable Diffusion teacher
+2.6% improvement on generative reasoning tasks for LLaMA-7B without any multimodal pre-training
Achieves these gains with only 0.8% additional learnable parameters compared to existing unimodal and multimodal KD methods

Breakthrough Assessment

7/10

Novel approach to distilling black-box vision models into text models without teacher training. Strong empirical results on NLU and reasoning, though the theoretical link between visual generation and text reasoning is abstract.

⚙️ Technical Details

Problem Definition

Setting: Cross-modal knowledge distillation where Teacher T is in modality M_t (Vision+Text) and Student S is in modality M_s (Text only), with M_s subset of M_t.

Inputs: Text sequence X_s (student input) and corresponding cross-modal representation/prompt X_t (teacher input)

Outputs: Enhanced text representation h'_s and task-specific predictions o_s

Pipeline Flow

Student Encoder (Text Input)
TS Aligner (Maps Student & Teacher to shared space)
Manifold Projection (Projects to common subspace)
Task Output Heads (Main + Auxiliary)

System Modules

Student Model (Encoding)

Processes text input to generate hidden representations

Model or implementation: BERT-6L, DeBERTa-v2, OPT-1.3B, or LLaMA-7B

Teacher Model (Frozen) (Encoding)

Provides cross-modal supervision signal (can be black-box)

Model or implementation: Stable Diffusion or Midjourney

TS Aligner (Alignment)

Non-linear mapping to align student/teacher features before projection

Model or implementation: MLP / Linear layers

Manifold Projector (Alignment)

Projects aligned representations to a common subspace for distance minimization

Model or implementation: Orthogonal projection layers P_ts, P_s

Novel Architectural Elements

TS Aligner module explicitly designing a bridge between text-only and multimodal feature spaces
Dual auxiliary output heads on projection vectors to enforce task-relevant manifold structure

Modeling

Base Model: Evaluated on BERT-6L, DeBERTa-v2-xxlarge, OPT-1.3B, LLaMA-7B/3B/8B

Training Method: Knowledge Distillation via Manifold Alignment

Objective Functions:

Purpose: Task-specific classification/regression loss.

Formally: Cross-entropy or MSE against ground truth Y.
Purpose: Distillation loss matching student logits to aligner logits.

Formally: KL-divergence or MSE between Student and Aligner outputs.
Purpose: Manifold alignment loss to minimize distance in projected space.

Formally: Combination of Cosine similarity, Euclidean distance, and Element-wise distance between projected representations p_ts and p_s.
Purpose: Auxiliary loss to ensure projections retain task information.

Formally: Task loss computed on auxiliary heads attached to projection vectors.

Trainable Parameters: Student model + TS Aligner (0.8% extra params during training)

Key Hyperparameters:

alpha: 0.5 (weight for output loss)
beta: 1 (weight for manifold alignment)
gamma: 1 (weight for auxiliary alignment)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Vokenization/VidLanKD: ARMADA requires NO pre-training of the teacher on multimodal data; works with off-the-shelf teachers.
vs. Vanilla KD: ARMADA operates across modalities (Vision->Text) and uses manifold alignment rather than just logit matching.
vs. Tang et al. (2021): ARMADA supports black-box teachers, whereas prior cross-modal methods require white-box access.

Limitations

Depends on the quality of the teacher's alignment between text and visual concepts
Theoretical justification relies on topological assumptions (homeomorphism) that may not perfectly hold in practice
Computational overhead of the TS Aligner during training (though removed at inference)

Reproducibility

Source code stated to be released upon acceptance. Teacher models used are standard (Stable Diffusion, Midjourney). Student models are standard HuggingFace models. Exact training compute/time not reported.

📊 Experiments & Results

Evaluation Setup

Natural Language Understanding and Generative Reasoning

Benchmarks:

GLUE (Natural Language Understanding)
SuperGLUE (Natural Language Understanding)
Commonsense Reasoning Tasks (Generative Reasoning (e.g., PIQA, OpenBookQA))
Mathematical Reasoning Tasks (Generative Reasoning (e.g., GSM8K))

Metrics:

Accuracy
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on NLU tasks using BERT-6L student and Stable Diffusion teacher.
GLUE/SuperGLUE (Avg)	Average Score	Not reported in the paper	Not reported in the paper	+3.4%
Generative Reasoning	Task-specific Accuracy	Not reported in the paper	Not reported in the paper	+2.6%
NLU Tasks	Average Score	Not reported in the paper	Not reported in the paper	+1.4%
NLU Tasks	Average Score	Not reported in the paper	Not reported in the paper	+1.5%

Experiment Figures

Conceptual mapping of homeomorphic spaces.

Main Takeaways

Consistent improvements across diverse architectures (BERT, DeBERTa, OPT, LLaMA) and sizes.
Effective distillation from both white-box (Stable Diffusion) and black-box (Midjourney) teachers.
Manifold alignment (Element-wise loss) provides the strongest regularization signal compared to Euclidean or Cosine losses.
Achieves gains without the massive pre-training compute cost required by previous cross-modal KD methods like VidLanKD.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation (KD) fundamentals
Manifold Learning / Representation Learning
Vision-Language Models (e.g., Stable Diffusion, CLIP)
Transformer architectures (BERT, LLaMA)

Key Terms

TS Aligner: A module introduced in this paper that aligns the student model's hidden states with the teacher's multimodal abstraction space

Manifold Alignment: Technique to map data from different modalities (e.g., text and image) into a shared lower-dimensional space where their structures are similar

Black-box Teacher: A teacher model whose internal weights and gradients are inaccessible; only inputs and outputs (or embeddings) are available

Logit Matching: Classic KD loss where the student tries to match the probability distribution (logits) of the teacher's output

Homeomorphism: A topological concept where a continuous, bijective mapping exists between two spaces, preserving their structural properties

Auxiliary Head: An extra output layer used during training to enforce specific constraints or learn features, often discarded at inference

GLUE/SuperGLUE: Standard benchmarks for Natural Language Understanding (NLU) tasks