RLVR: Reinforcement Learning with Verifiable Rewards—training models on tasks where correctness can be automatically checked (e.g., math, code)
Influence Function: A measure of how much a specific training data point contributes to the model's performance on a validation set (usually via gradient inner products)
Off-Policy: Learning or estimating values for a target policy using data generated by a different (behavior) policy
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing scores within a group of outputs for the same prompt, avoiding a separate critic model
Rollout: The process of generating a complete sequence of tokens (a trajectory) from the policy to estimate rewards
Sparse Random Projection: A dimensionality reduction technique where a matrix is projected into a lower-dimensional space using a sparse matrix, preserving distances/angles with high probability
CROPI: Curriculum RL with Off-Policy Influence Guidance—the proposed framework
POPI: Practical Off-Policy Influence estimation—the specific metric used to score data utility