RMI: Reverse Mutual Information—A metric quantifying the information gain about the query provided by the answer, calculated as log PPL(Q) - log PPL(Q|A).
IFD: Instruction-Following Difficulty—A baseline metric measuring the difficulty of generating an answer given a query (A|Q).
PPL: Perplexity—A measurement of how well a probability model predicts a sample; lower values indicate better prediction.
Cognitive Gap: The difference in RMI ranking between a strong model and a weak model, used to identify samples that are valid (recognized by strong) but challenging (hard for weak).
Stratified RMI: Partitioning data into bins based on query perplexity before ranking by RMI, ensuring simple and complex queries are evaluated fairly against peers.
DeepSeek-Coder: The specific strong language model architecture used for evaluation and selection in this paper.
Qwen3: The specific weak language model used to calculate the disagreement signal.
Hallucination: In this context, synthetic data where the query makes no sense or the answer is unrelated to the query.