HARR: History-Aware Reinforced Retriever—the proposed framework
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple sampled outputs for the same input, removing the need for a value function
Plackett-Luce model: A probability distribution for ranking items, used here to sample ordered lists of documents stochastically
state aliasing: A situation in RL where different environment states appear identical to the agent (e.g., same query but different history), preventing optimal decision making
sparse terminal reward: A reward signal received only at the end of the episode (final answer accuracy), with no intermediate feedback
sub-query: An intermediate search query generated by the LLM during multi-hop reasoning
dense retriever: A retrieval model that uses vector embeddings to find relevant documents