Visual Prefix: A sequence of continuous embeddings derived from an image that serves as a prompt for the language model, functionally similar to text tokens
Frozen: The specific method proposed where the language model parameters are fixed (frozen) and only the vision encoder is trained
Fast Binding: The ability to associate a new word with a visual category from just a few examples and immediately use it correctly
NF-ResNet-50: Normalizer-Free ResNet-50, a specific convolutional neural network architecture used as the vision encoder
C4: Colossal Clean Crawled Corpus, the massive text dataset used to pre-train the language model
In-context learning: The ability of a model to improve performance on a task by seeing examples of that task within the input prompt, without weight updates
Autoregressive: Predicting the next element in a sequence based on previous elements
Conceptual Captions: A dataset of 3 million image-caption pairs used to train the vision encoder
VQAv2: Visual Question Answering version 2, a benchmark dataset for testing the model's ability to answer questions about images
OKVQA: Outside Knowledge VQA, a benchmark requiring external knowledge not present in the image to answer correctly