MCSU: Minimal Complete Semantic Unit—a word, number, or punctuation mark representing the smallest unit of complete meaning, used to align different tokenizers (e.g., 'apple' is an MCSU, 'ap' is not).
DDS: Distribution Distance-based Dynamic Selection—a strategy to filter outlier probability distributions from an ensemble based on their KL divergence from the consensus.
KL divergence: A statistical measure quantifying how one probability distribution differs from a second, reference probability distribution.
Top-k sampling: A decoding strategy that considers only the k most likely next tokens to reduce the search space.
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.
Vocabulary Misalignment: The issue where different models use different subword tokenizers, making their output probability vectors incompatible for direct element-wise operations.