MLLM: Multi-modal Large Language Model—an AI system capable of processing and generating multiple types of media (text, image, audio) simultaneously
LoRA: Low-Rank Adaptation—a technique to fine-tune large models by updating only a small set of added parameters, keeping the main model frozen
DTW: Dynamic Time Warping—an algorithm used to measure similarity between two temporal sequences (like speech and text) which may vary in speed
SFT: Supervised Fine-Tuning—training a model on labeled datasets to follow instructions
Qwen2-VL: A specific open-source Vision-Language Model used as the backbone for Lyra
Whisper: A speech recognition model developed by OpenAI, used here as the audio encoder
LCMR: Latent Cross-Modality Regularizer—Lyra's method for aligning speech tokens with text tokens in the hidden space
ViT: Vision Transformer—a model architecture for processing images as sequences of patches