PPO: Proximal Policy Optimization—a popular reinforcement learning algorithm that improves stability by limiting how much the policy can change in one step
Actor-Critic: An RL architecture with two networks: an Actor that decides which action to take, and a Critic that estimates how good that state is (value function)
Behavioral Cloning (BC): A form of imitation learning where a model is trained via supervised learning to mimic expert actions given states
Catastrophic Forgetting: A phenomenon where a neural network abruptly loses previously learned knowledge (here, the expert behavior) when training on new data
Rollout: A sequence of interactions (state, action, reward) generated by running a policy in the environment
GAE: Generalized Advantage Estimation—a method to reduce variance in policy gradient estimates
Residual Connection: A skip connection in a neural network that allows gradients/information to bypass intermediate layers, often used here to preserve expert features
PIRL: Pretraining with Imitation and RL fine-tuning—a baseline method where the actor is frozen while the critic is trained, before joint optimization