RoPE-2D: Two-dimensional Rotary Positional Embeddings—a position encoding method that captures relative height and width relationships between image patches, enabling variable resolution processing
Pixtral-ViT: The custom 400M-parameter vision transformer trained from scratch for Pixtral, capable of ingesting images at native aspect ratios
MM-MT-Bench: Multimodal Multi-Turn Benchmark—a new dataset created by the authors to evaluate multimodal assistants in practical, multi-turn conversation scenarios
Explicit prompts: Evaluation prompts that rigorously define the required output format (e.g., 'Final answer: X') to prevent scoring errors due to formatting mismatches
GeLU: Gaussian Error Linear Unit—a smooth activation function used in the projection layer between the vision encoder and language decoder
ImageNet: A large visual database used for training standard vision encoders; Pixtral's encoder departs from standard ImageNet-optimized fixed resolutions
ELO: A rating system calculated from pairwise comparisons (wins/losses) to rank models, used here for the LMSys Vision Leaderboard
Pearson Correlation Coefficient: A statistic measuring linear correlation between two variables (here, benchmark scores and human preference ratings)