QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

📝 Paper Summary

Synthetic Data Selection Instruction Tuning Data Curation

QAQ selects high-quality synthetic data by measuring how well the answer predicts the query (Reverse Mutual Information) and prioritizing samples where strong and weak models disagree on difficulty.

Core Problem

Standard metrics like IFD assess how easily a model generates an answer given a query ($A|Q$), which fails to detect synthetic hallucinations where a nonsense query prompts a confident but meaningless answer.

Why it matters:

Synthetic data generation creates massive noise and 'hallucinations' that surface-level metrics cannot detect
Current selection methods focus on answer quality or generation difficulty, ignoring whether the query itself is semantically coherent or meaningful
Training on trivial or nonsensical data (high volume, low signal) wastes compute and degrades model performance

Concrete Example: A synthetic query asks to 'convert a gritty sulla matrix' (fabricated terminology). A model generates syntactically valid code echoing these terms. Forward metrics ($A|Q$) score this high because the answer follows the instruction, but the data provides zero learning signal.

Key Novelty

Reverse Mutual Information (RMI) & Cognitive Gap Selection

Evaluates data in reverse ($Q|A$): checks if the answer explains the question. If seeing the answer doesn't help predict the question, they are semantically misaligned.
Identifies a 'Cognitive Gap': Selects samples where a strong model sees high coherence (validity) but a weak model sees low coherence (difficulty), ensuring data is both correct and challenging.
Stratifies selection by query perplexity to prevent biasing against complex questions.

Evaluation Highlights

Selecting just 25% of data using Stratified RMI matches full-dataset training performance on HumanEval+ (72.56 vs 72.56)
Disagreement-based selection (Diff-High) outperforms consensus-based selection (Sum-High) by 3.05 points on HumanEval+ (71.95 vs 68.90)
Outperforms existing selection methods (IFD, SCAR, Random) across HumanEval(+) and MBPP(+) benchmarks at equivalent data sizes

Breakthrough Assessment

8/10

Introduces a theoretically grounded reverse-direction metric that effectively filters hallucinations—a major plague in synthetic data. The use of model disagreement to target 'learnable' samples is a strong, intuitive contribution.

⚙️ Technical Details

Problem Definition

Setting: Data selection for instruction tuning

Inputs: A large noisy dataset of synthetic instruction-response pairs D = {(Qi, Ai)}

Outputs: A high-quality subset S representing approximately 25% of the original data

Pipeline Flow

Metrics Calculation: Compute PPL(Q) and PPL(Q|A) for all samples
Metric Calculation: Calculate RMI = log PPL(Q) - log PPL(Q|A)
Stratification: Bin samples by PPL(Q) to normalize for query complexity
Scoring: Compute normalized ranks within bins for both Strong and Weak models
Selection: Select samples where Rank(Strong) - Rank(Weak) > Threshold

System Modules

RMI Calculator

Computes the bidirectional coherence score

Model or implementation: DeepSeek-Coder-6.7B-Base (Strong) and Qwen3-0.6B (Weak)

Stratified Selector

Filters data based on RMI ranks and model disagreement

Model or implementation: N/A (Statistical operation)

Novel Architectural Elements

Reverse generation task framing: Using the chat template to compute P(Q|A) (answer explaining question) rather than standard forward generation
Stratified ranking pipeline: Explicit normalization step to decouple query complexity (PPL(Q)) from semantic coherence (RMI)

Modeling

Base Model: DeepSeek-Coder-6.7B-Base

Training Method: Supervised Fine-Tuning (SFT)

Trainable Parameters: Full fine-tuning

Training Data:

WarriorCoder dataset (310K samples originally)
Subsets of 25% or 50% selected via QAQ strategies

Key Hyperparameters:

batch_size: 256 (for subsets)
learning_rate: 0.4e-4 (for 25% data)
epochs: 3
+ 2 more
warmup_ratio: 0.2
scheduler: Cosine decay

Compute: Not reported in the paper

Comparison to Prior Work

vs. IFD: QAQ looks at the reverse direction P(Q|A) to ensure the answer explains the query, filtering hallucinations that IFD misses.
vs. SCAR: QAQ focuses on semantic coherence and difficulty rather than style consistency.
vs. Alpagasus [not cited in paper]: Alpagasus filters via a strong model (ChatGPT) prompt; QAQ uses probability-based metrics (RMI) without needing an external API call for judging.
+ 1 more
vs. Superfiltering: QAQ uses the discrepancy between weak and strong models as a feature, whereas Superfiltering uses weak models to approximate strong models.

Limitations

Computational cost involves running inference (perplexity calculation) on the full dataset twice (forward and reverse).
RMI calculation is sensitive to chat templates; removing them causes 'cold start' bias.
Requires both a strong and a weak model to compute the disagreement signal effectively.

Reproducibility

The WarriorCoder dataset reproduction from HuggingFace was used (original not public). Code for the selection pipeline is not explicitly linked. Hyperparameters for fine-tuning are provided.

📊 Experiments & Results

Evaluation Setup

Fine-tuning a code LLM on selected synthetic data subsets

Benchmarks:

HumanEval (Python Code Generation)
HumanEval+ (Python Code Generation (Enhanced Tests))
MBPP (Python Code Generation)
MBPP+ (Python Code Generation (Enhanced Tests))

Metrics:

Pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of QAQ (RMI-based selection) against full data training and baselines using 25% of the data.
HumanEval+	Pass@1	72.56	72.56	0.00
HumanEval+	Pass@1	69.51	72.56	+3.05
MBPP+	Pass@1	56.61	58.47	+1.86
Ablation studying the impact of using model disagreement (Diff) vs consensus (Sum) for selection.
HumanEval+	Pass@1	68.90	71.95	+3.05
HumanEval+	Pass@1	72.56	70.12	-2.44

Main Takeaways

Both extremes of RMI are harmful: extremely low RMI indicates semantic misalignment (hallucinations), while extremely high RMI indicates trivial patterns (shortcuts/paraphrasing).
The 'Sweet Spot' is the mid-to-high range (50-75%), or samples where models disagree (Cognitive Gap).
Model disagreement is a strong signal for data efficiency: samples that are clear to a strong model but confusing to a weak model provide the best training signal.

📚 Prerequisite Knowledge

Prerequisites

Language Modeling (Perplexity)
Information Theory (Mutual Information)
Instruction Tuning

Key Terms

RMI: Reverse Mutual Information—A metric quantifying the information gain about the query provided by the answer, calculated as log PPL(Q) - log PPL(Q|A).

IFD: Instruction-Following Difficulty—A baseline metric measuring the difficulty of generating an answer given a query (A|Q).

PPL: Perplexity—A measurement of how well a probability model predicts a sample; lower values indicate better prediction.

Cognitive Gap: The difference in RMI ranking between a strong model and a weak model, used to identify samples that are valid (recognized by strong) but challenging (hard for weak).

Stratified RMI: Partitioning data into bins based on query perplexity before ranking by RMI, ensuring simple and complex queries are evaluated fairly against peers.

DeepSeek-Coder: The specific strong language model architecture used for evaluation and selection in this paper.

Qwen3: The specific weak language model used to calculate the disagreement signal.

Hallucination: In this context, synthetic data where the query makes no sense or the answer is unrelated to the query.