PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm that improves stability by limiting how much the policy can change in one step
Hard constraints: System designs where the RL agent is forced to execute LLM suggestions or where LLM outputs directly modify the reward function
Soft constraints: The proposed approach where LLM suggestions are provided as information (observations) that the agent can choose to utilize or ignore
Sparse-reward environment: A setting where the agent receives feedback (reward) very rarely, usually only upon completing a long, complex task
BabyAI: A gridworld benchmark suite for grounded language learning tasks, used here to test navigation and object interaction
POMDP: Partially Observed Markov Decision Process—a mathematical framework for decision-making where the agent cannot see the entire state of the world
Chain-of-thought: A prompting technique where the LLM is asked to articulate its reasoning steps before producing a final answer