action set: A strict subset of all possible actions presented to the user, effectively filtering out poor choices while retaining a range of good options.
Lipschitz continuity: A smoothness property of a function where the rate of change is bounded; here, it means small changes in the agency parameter epsilon lead to limited changes in expected reward.
simple regret: The difference between the expected payoff of the optimal parameter choice and the parameter actually selected by the algorithm after n rounds.
zooming dimension: A measure of the difficulty of a bandit problem; it captures how many near-optimal arms need to be explored.
Deep Q-Network (DQN): A reinforcement learning algorithm that uses a neural network to estimate the value (Q-value) of taking a specific action in a specific state.
min-max normalization: Rescaling data (here, action scores) to a fixed range, typically [0, 1].
SFT: Supervised Fine-Tuning—training a model on labeled examples before applying reinforcement learning (mentioned in context of related work/baselines).