RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using RL where the reward is determined by a deterministic check of the final answer (e.g., math problems).
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, eliminating the need for a separate value function critic.
SwS: Self-aware Weakness-driven Problem Synthesis—the proposed framework.
Failure Rate: A metric identifying weaknesses; defined as problems where the model never reaches 50% accuracy and shows a negative accuracy slope over training epochs.
Concept Recombination: The process of taking keywords/topics from failed problems (e.g., 'geometry', 'area') and combining them to prompt a generator for new questions.
Math-Verify: A tool or library used to rigorously check if a generated math answer matches the ground truth, handling different formats (fractions vs decimals).
Self-consistency: A method where a model generates multiple answers, and the most frequent answer is selected as the pseudo-label.
Pass@K: A metric measuring the probability that at least one of K generated samples is correct.
SFT: Supervised Fine-Tuning—training on labeled data before RL.
KL term: Kullback-Leibler divergence—a penalty term often used in RL to keep the new policy close to the old one; omitted in this paper's optimization.