MoE: Mixture-of-Experts—a model architecture that activates only a subset of parameters (experts) for each token, increasing capacity without proportional inference cost
RoPE: Rotary Positional Embedding—a method for encoding position information in transformers by rotating the query and key vectors
Seed-ViT: The custom vision encoder used in this paper, designed for native dynamic resolution processing
MIM: Masked Image Modeling—a pre-training task where the model learns to reconstruct masked parts of an image
OCR: Optical Character Recognition—converting images of text into machine-readable text formats
SFT: Supervised Fine-Tuning—training the model on high-quality instruction-response pairs
RLHF: Reinforcement Learning from Human Feedback—fine-tuning models to maximize rewards defined by human preferences
GUI: Graphical User Interface—visual interfaces that the model learns to interact with (clicking, typing)
STEM: Science, Technology, Engineering, and Mathematics—refers here to datasets and tasks involving academic reasoning
ViT: Vision Transformer—an architecture that applies transformers to image patches