TPE: Trajectory Preference Evaluation—a mechanism to check if a generated reward function correctly assigns higher returns to successful trajectories than failed ones, used to filter code before training.
SAC: Soft Actor-Critic—an off-policy reinforcement learning algorithm used as the underlying RL solver for the tasks.
Meta-World: A benchmark environment for robotic manipulation tasks involving a Sawyer robot arm.
ManiSkill2: A simulation benchmark for generalizable robot manipulation skills.
Process Feedback: Feedback generated from the training curve statistics (return, success rate, sub-reward values) to guide the LLM.
Trajectory Feedback: Feedback generated by analyzing specific step-by-step details of successful and failed rollout trajectories.
Preference Feedback: Feedback provided when a reward function fails the TPE check, explaining that successful trajectories were not ranked higher than failed ones.