← Back to Paper List

QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Jiayin Lei, Ming Ma, Yunxi Duan, Chenxi Li, Tianming Yang
Beijing University of Technology, Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, University of Chicago
arXiv (2026)
Pretraining Factuality Reasoning

📝 Paper Summary

Synthetic Data Selection Instruction Tuning Data Curation
QAQ selects high-quality synthetic data by measuring how well the answer predicts the query (Reverse Mutual Information) and prioritizing samples where strong and weak models disagree on difficulty.
Core Problem
Standard metrics like IFD assess how easily a model generates an answer given a query ($A|Q$), which fails to detect synthetic hallucinations where a nonsense query prompts a confident but meaningless answer.
Why it matters:
  • Synthetic data generation creates massive noise and 'hallucinations' that surface-level metrics cannot detect
  • Current selection methods focus on answer quality or generation difficulty, ignoring whether the query itself is semantically coherent or meaningful
  • Training on trivial or nonsensical data (high volume, low signal) wastes compute and degrades model performance
Concrete Example: A synthetic query asks to 'convert a gritty sulla matrix' (fabricated terminology). A model generates syntactically valid code echoing these terms. Forward metrics ($A|Q$) score this high because the answer follows the instruction, but the data provides zero learning signal.
Key Novelty
Reverse Mutual Information (RMI) & Cognitive Gap Selection
  • Evaluates data in reverse ($Q|A$): checks if the answer explains the question. If seeing the answer doesn't help predict the question, they are semantically misaligned.
  • Identifies a 'Cognitive Gap': Selects samples where a strong model sees high coherence (validity) but a weak model sees low coherence (difficulty), ensuring data is both correct and challenging.
  • Stratifies selection by query perplexity to prevent biasing against complex questions.
Evaluation Highlights
  • Selecting just 25% of data using Stratified RMI matches full-dataset training performance on HumanEval+ (72.56 vs 72.56)
  • Disagreement-based selection (Diff-High) outperforms consensus-based selection (Sum-High) by 3.05 points on HumanEval+ (71.95 vs 68.90)
  • Outperforms existing selection methods (IFD, SCAR, Random) across HumanEval(+) and MBPP(+) benchmarks at equivalent data sizes
Breakthrough Assessment
8/10
Introduces a theoretically grounded reverse-direction metric that effectively filters hallucinations—a major plague in synthetic data. The use of model disagreement to target 'learnable' samples is a strong, intuitive contribution.
×