RLVR: Reinforcement Learning with Verifiable Rewards—training models using objective success signals (like passing a unit test) rather than imitating human text
ReSyn: The proposed pipeline that autonomously generates diverse reasoning environments (generators + verifiers) using LLMs
BBH: Big-Bench Hard—a benchmark suite of challenging reasoning tasks where language models previously struggled
BBEH: Big-Bench Extra Hard—a harder version of BBH designed to test reasoning capabilities at a higher difficulty level
DAPO: An RL algorithm (Direct Alignment with Preference Optimization or similar) used here to train the policy model using verifier rewards
Observation Function: A function O(s) that converts internal problem parameters (e.g., a grid array) into a natural language question
Verifier: A code-based function V(a) that checks if a candidate answer 'a' satisfies the constraints of the problem instance
Generator-Verifier Gap: The concept that verifying a solution is often computationally easier than finding it (e.g., checking a sorted list vs. sorting it), enabling supervision for hard problems