AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Instruction Tuning

AnyMAL aligns diverse modalities (image, video, audio, IMU) to a frozen 70B LLM via lightweight adapters and fine-tunes with a custom multimodal instruction set for broad reasoning capabilities.

Core Problem

Prior multimodal LLMs typically focus on limited modalities (mostly image-text), often lack open-source scalability, or fail to generalize to diverse instruction-following tasks beyond simple Q&A.

Why it matters:

Restricting models to single non-text modalities (like only images) limits their ability to reason over complex, real-world sensory environments.
Existing datasets for instruction tuning often lack the diversity required for creative or complex reasoning (e.g., 'write a poem based on this image').
Scaling multimodal pre-training to 70B parameter models is computationally expensive and memory-intensive.

Concrete Example: When asked to 'Write a short story about the scene' involving seagulls, baselines like BLIP-2 or InstructBLIP often output short, caption-like fragments (e.g., 'a bird'), whereas AnyMAL generates a coherent narrative with dialogue.

Key Novelty

Unified Multimodal Alignment with Quantized Pre-training

Projects signals from diverse pre-trained encoders (image, audio, video, IMU) into a shared text embedding space using lightweight adapters while keeping the massive LLM (70B) frozen.
Uses quantization (4-bit/8-bit) during pre-training to fit the 70B model on standard hardware, enabling scalable alignment without full fine-tuning.
Introduces a manually collected 'Multimodal Instruction Tuning' (MM-IT) dataset specifically designed for open-ended, complex reasoning tasks beyond standard VQA.

Architecture

The AnyMAL methodology: aligning diverse modalities (Image, Audio, Video, IMU) to a frozen LLM via lightweight adapters.

Evaluation Highlights

+7.0% relative accuracy improvement on VQAv2 zero-shot compared to literature baselines.
+8.4 CIDEr score improvement on zero-shot COCO image captioning compared to previous state-of-the-art.
+14.5 CIDEr score improvement on AudioCaps audio captioning compared to literature baselines.

Breakthrough Assessment

8/10

Significantly expands multimodal capabilities beyond vision to audio/IMU with strong zero-shot performance and a new high-quality instruction dataset. Demonstrates effective scaling to 70B models via quantization.

⚙️ Technical Details

Problem Definition

Setting: Multimodal conditional text generation

Inputs: Multimodal input X_modality (image, video, audio, or IMU) and optional text prompt X_text

Outputs: Textual response sequence Y

Pipeline Flow

Modality Encoder (extracts features from raw input)
Projection Layer/Adapter (aligns features to LLM space)
LLM Backbone (generates text response)

System Modules

Modality Encoder

Encode raw modality data into feature representations

Model or implementation: Modality-specific: CLIP (ViT-L/G) for images, CLAP for audio, IMU2CLIP for IMU, InternVideo/CLIP for video

Projection Layer

Project modality features into the LLM's token embedding space with fixed token count

Model or implementation: Perceiver Resampler (vision) or Linear Layers (audio/IMU)

LLM Backbone

Generate text response conditioned on multimodal tokens and text instructions

Model or implementation: LLaMA-2-70B-chat

Novel Architectural Elements

Integration of IMU motion sensor data into the LLM embedding space via IMU2CLIP alignment
Unified adapter framework allowing interleaved modality prompting (e.g., Image + IMU tokens in one context)

Modeling

Base Model: LLaMA-2-70B-chat

Training Method: Two-stage training: (1) Pre-training alignment (Adapter only), (2) Multimodal Instruction Tuning (Adapter + LoRA)

Objective Functions:

Purpose: Align modality features to text.

Formally: Next-token prediction loss (causal language modeling objective) on paired data.

Adaptation: LoRA (Low-Rank Adaptation) used during instruction tuning stage

Trainable Parameters: Projection layers (Adapters) and LoRA weights; Base LLM is frozen

Training Data:

Pre-training: 200M Images (LAION-2B subset), 2.2M Audio (AudioSet, AudioCaps, Clotho), 500K IMU (Ego4D), 28M Videos
Fine-tuning: 60K manually collected MM-IT pairs + 150K synthetic LLaMA-2 pairs

Key Hyperparameters:

batch_size: 4 (per GPU with quantization)
modality_tokens: 64 - 256
resampler_layers: 6 (optimal based on ablation)

Compute: Trainable on single 80GB VRAM GPU (via quantization)

Comparison to Prior Work

vs. LLaVA: AnyMAL scales to 70B via quantization and supports Audio/IMU/Video, not just images
vs. Flamingo: AnyMAL uses open-source LLaMA-2 backbone and supports non-visual modalities
vs. PandaGPT [not cited in paper]: PandaGPT also supports multi-modality (image/audio) via ImageBind, but AnyMAL scales to 70B and adds IMU support

Limitations

Slight under-performance on COCO captioning for 70B model due to verbosity penalties in standard metrics
Decline in detailed object recognition accuracy after instruction tuning compared to pre-trained checkpoint
High computational cost for 70B model inference despite quantization during training

Reproducibility

Not provided: code, model weights, or specific prompt templates. Available: Public datasets used for training (LAION, AudioSet, Ego4D, COCO, etc.) are cited. Methodology relies on open-source components (LLaMA-2, CLIP), but the specific AnyMAL implementation is not released.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on captioning and VQA benchmarks across modalities

Benchmarks:

COCO (Image Captioning)
VQAv2 (Visual Question Answering)
AudioCaps (Audio Captioning)
MM-IT (Multimodal Instruction Following) [New]
Hateful Meme (Visual Reasoning/Classification)

Metrics:

CIDEr
SPICE
VQA Accuracy
Human Evaluation (Win Rate)
ROUGE-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot image captioning results on COCO and MM-IT datasets show AnyMAL variants outperforming baselines.
COCO	CIDEr	121.5	129.9	+8.4
MM-IT-Cap	CIDEr	11.6	39.6	+28.0
Zero-shot VQA performance shows AnyMAL achieving state-of-the-art results on standard benchmarks.
VQAv2	Accuracy	65.0	69.5	+4.5
Hateful Meme	Accuracy	57.0	68.0	+11.0
Audio captioning results demonstrate strong generalization to non-visual modalities.
AudioCaps	CIDEr	63.8	78.3	+14.5
Human evaluation on the MM-IT dataset for reasoning tasks.
MM-IT (Human Eval)	Win Rate vs Ground Truth	34.4	41.1	+6.7

Experiment Figures

Human evaluation results (win rates) comparing AnyMAL against baselines (LLaVA, MiniGPT-4) on the MM-IT dataset.

Ablation study on training loss for AnyMAL-13B varying resampler layers, visual tokens, and batch size.

Main Takeaways

Scaling the LLM to 70B parameters significantly improves multimodal reasoning capabilities, even when the LLM is frozen during pre-training.
Quantization is an effective strategy for training massive multimodal models (70B) on limited resources without sacrificing generation quality.
The proposed Multimodal Instruction Tuning (MM-IT) dataset improves performance on open-ended creative tasks where standard VQA models often fail.
The architecture successfully aligns diverse modalities (IMU, Audio) to text, enabling novel applications like motion-aware captioning.

📚 Prerequisite Knowledge

Prerequisites

Transformer architectures
Multimodal alignment strategies (e.g., CLIP, projection layers)
Instruction tuning for LLMs
Quantization methods (QLoRA)

Key Terms

IMU: Inertial Measurement Unit—sensors that measure force, angular rate, and orientation (motion data)

Perceiver Resampler: A neural network module that converts variable-length input features into a fixed number of token embeddings

FSDP: Fully Sharded Data Parallel—a technique to distribute model parameters across multiple GPUs to save memory

QLoRA: Quantized Low-Rank Adaptation—a fine-tuning method that uses quantized weights (e.g., 4-bit) and trains only small adapter layers

CIDEr: Consensus-based Image Description Evaluation—an automated metric for evaluating the quality of image captions against human references

VQA: Visual Question Answering—a task where a computer answers text questions based on an image

LLM: Large Language Model—a massive neural network trained on text to generate human-like language

SPICE: Semantic Propositional Image Caption Evaluation—a metric that evaluates caption quality based on scene graphs

ROUGE-L: Recall-Oriented Understudy for Gisting Evaluation (Longest Common Subsequence)—a metric measuring text overlap between generated and reference summaries