Intra-ARM: Intra-Agent Rigor Module—a component that validates individual agent actions (e.g., checking if code compiles) before they are finalized
Inter-ARM: Inter-Agent Rigor Module—a component that coordinates workflow between agents, managing task partitioning and state transitions
Architect Agent: High-level planner agent responsible for designing the experiment, defining variables, and analyzing final results
Technician Agent: Low-level executor agent responsible for writing code, setting up environments, and running trials
Experiment Knowledge Module: A structured database (DAG-like history) that tracks the state of the experiment, preventing LLM memory loss and hallucination
SWE-Bench: Software Engineering Benchmark—a standard dataset for evaluating LLMs on real-world coding issues
process supervision: A technique where feedback is provided at intermediate steps of reasoning/execution rather than just on the final output
DAG: Directed Acyclic Graph—a data structure used here to track the history of experimental changes without loops