CodeLLMs: Large Language Models specifically fine-tuned on code datasets to perform programming tasks
RLHF: Reinforcement Learning from Human Feedback—a method to align LLM outputs with specific goals using reward signals
PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates model policies in stable, bounded steps
ACECode: Aligning Code Correctness and Efficiency—the proposed framework using RL and execution feedback
EffiBench: A benchmark dataset designed to evaluate the execution efficiency of code generated by LLMs
pass@1: A metric measuring the percentage of problems where the first generated code solution is functionally correct
Actor-Critic: An RL architecture where the 'Actor' generates actions (code) and the 'Critic' estimates the value of those actions to guide training
PIE: A baseline method that improves code efficiency via instruction tuning on a dataset of efficient code snippets
SOAP: A baseline method that uses a two-stage inference process with execution feedback to optimize code
Instruction Tuning: Fine-tuning LLMs on datasets of (instruction, response) pairs to improve their ability to follow tasks