MwpBench: A new benchmark proposed in this paper comprising 10 distinct math datasets (K-12 to college level) with a unified evaluation protocol
MathScaleQA: The synthetic dataset of 2 million math question-answer pairs generated by the MathScale pipeline
Concept Graph: A graph where nodes are math topics/knowledge points and edges represent co-occurrence in seed questions, used to sample new concept combinations
GSM8K: A popular dataset of grade school math word problems
MATH: A dataset of challenging competition-level mathematics problems
Knowledge Points (KPs): Fine-grained math concepts (e.g., 'Pythagorean theorem') extracted from questions
Topics: High-level mathematical subjects (e.g., 'Geometry', 'Algebra') extracted from questions
Fuzzy Match: An answer verification method that matches predicted answers to ground truth even if formatted slightly differently (e.g., allowing for minor text variations)
Greedy Decoding: A decoding strategy that always selects the highest probability token at each step, eliminating randomness