ASR: Automatic Speech Recognition—converting spoken audio into text
TTS: Text-to-Speech—converting text into spoken audio
MLLM: Multimodal Large Language Model—AI that processes multiple data types (text, images, audio) simultaneously
InternViT: A specific vision transformer model used as the visual encoder
TiCodec: A codec model used to compress continuous audio into discrete tokens and decode them back to waveforms
NAR Decoder: Non-Autoregressive Decoder—generates outputs in parallel (globally) rather than sequentially
AR Decoder: Autoregressive Decoder—generates outputs sequentially, one token at a time
CTC loss: Connectionist Temporal Classification—a loss function used to align sequences of different lengths, common in speech recognition
SFT: Supervised Fine-Tuning—training on labeled instruction-response pairs
Dynamic Patching: A technique to handle high-resolution images by splitting them into smaller grids (patches)