MLM: Masked Language Modeling—a pre-training objective where the model predicts randomly hidden tokens in a sequence based on bidirectional context.
Encoder-only: Models like BERT that process the entire input sequence simultaneously (bidirectional attention), typically used for understanding rather than generation.
Verbalizer: A specific token (e.g., 'Positive') mapped to a class label (e.g., Sentiment=1) that the model is expected to generate.
Cloze question: A fill-in-the-blank test where a participant is asked to replace a missing word in a text; used here as the format for generative classification.
MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects like STEM, humanities, and social sciences to test general knowledge and problem solving.
FLAN: Finetuned Language Net—a collection of datasets transformed into instruction-following formats, used here for training the encoder.
MMLU-Pro: A harder version of MMLU with more distractors (10 options vs 4) and reasoning-focused questions.