MONA: Myopic Optimization with Non-myopic Approval—a training method combining short-sighted optimization with rewards representing long-term approval
reward hacking: When an agent achieves high reward in a way the system designer did not intend and would not approve of
myopic optimization: Optimizing an agent to maximize only the immediate next reward (effectively discount factor gamma = 0), making it 'short-sighted'
non-myopic approval: A component of the reward function where an overseer estimates the future utility of the agent's current action
spotlight: The subset of possible policies or strategies that human experts can understand and safely evaluate
steganography: Hiding information within other data (e.g., hiding text in a scratchpad) to pass information secretly
Causal Influence Diagram: A graphical representation (DAG) used to model the causal relationships and incentives between an agent's decisions, state variables, and rewards
instrumental control incentive: The incentive an agent has to control a specific variable if doing so allows it to achieve higher utility