Base model: A large language model pre-trained on massive text corpora to predict the next token, serving as the foundation for further tuning.
Instruct model: A version of the Base model that has undergone post-training (SFT, RLHF) to follow user instructions and align with human preferences.
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes original weights and trains small low-rank matrices to approximate weight updates.
DPO: Direct Preference Optimization—an alignment method that optimizes models to prefer specific responses over others without a separate reward model.
BAAI-2k: A subset of 2000 high-quality instruction samples extracted from the BAAI-Infinity-Instruct Dataset used for tuning experiments in this paper.
MLLM: Multimodal Large Language Model—an LLM capable of processing and generating content across multiple modalities like text and images.
Task Vector: A vector representing the difference in weights between a fine-tuned model and its pre-trained base, encoding specific task capabilities.
Gradient descent: An optimization algorithm used to minimize the loss function by iteratively moving in the direction of steepest descent.
Pass@k: A metric measuring the probability that at least one of the top k generated code solutions is correct.