GRPO: Group Relative Policy Optimization—a value-free RL method that estimates advantages by comparing trajectory rewards against the group mean, avoiding a learned value network.
HCA: Hindsight Credit Assignment—a technique to estimate the value of an action by conditioning on the future outcome observed in the trajectory.
POMDP: Partially Observable Markov Decision Process—a framework where an agent makes decisions based on incomplete knowledge of the environment state.
Importance Ratio: The ratio between the probability of an action under a target distribution (hindsight) and the behavior distribution (policy), used to re-weight updates.
Process Reward Models: Models trained to provide feedback at every step of a reasoning chain, typically requiring expensive human-annotated data.
Value-free methods: RL approaches that optimize policies without training a separate neural network (Critic) to estimate state values, saving memory.
Qwen2.5-7B-Instruct: The specific open-source Large Language Model used as the backbone for the agents in the experiments.
Generative Verification: The paper's method of using the LLM to re-evaluate its own past actions given the known successful outcome.
Do-no-harm mask: A mechanism that zeroes out negative hindsight signals in successful trials to prevent suppressing useful actions.
Temporal smoothing: A technique to distribute credit across adjacent reasoning and action steps to stabilize learning in rigid causal chains.