PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

📝 Paper Summary

Vision-Language-Action (VLA) models Goal-Conditioned Reinforcement Learning Robotic Foundation Models

PRTS reformulates VLA pretraining using goal-conditioned contrastive reinforcement learning, enabling models to learn goal-reachability awareness directly from offline trajectories without requiring explicit reward annotations.

Core Problem

Existing Vision-Language-Action (VLA) models pretrain almost exclusively via supervised behavior cloning, which teaches static semantic understanding but fails to capture the temporal, goal-reaching nature of robotic trajectories.

Why it matters:

Robotic execution is inherently a goal-reaching process over time, requiring temporal awareness of how close the current state is to the objective
Without quantitative goal-reachability awareness, models cannot evaluate execution difficulty or distinguish difficult-yet-reachable states from easy-yet-erroneous ones
Prior value-augmented approaches require costly manual reward annotations, curated pairwise progress labels, or auxiliary value networks that double training costs

Concrete Example: When an agent is trained purely on behavior cloning, it only learns 'what to do' (semantic reasoning). If it encounters a state slightly off the optimal path, it lacks a quantitative estimate of 'how likely it is to still reach the goal.' In contrast, PRTS learns a unified embedding space where the inner product of state-action and goal embeddings explicitly estimates this probability, guiding more robust task execution.

Key Novelty

Goal-Conditioned Contrastive Representation Learning for VLAs

Treats language instructions as shared goals across a trajectory, using a temporal weighting scheme to emulate the geometric sampling used in standard contrastive reinforcement learning
Appends specialized token blocks evaluated with a custom role-aware causal mask to compute goal reachability and predict actions simultaneously in a single forward pass
Extracts dense goal-reachability supervision directly from trajectory data without any manual reward labels, folding value-based planning directly into the vision-language backbone

Architecture

The unified VLA architecture combining Auto-Regressive token generation and Contrastive Reinforcement Learning in a single forward pass using a structurally sparse role-aware attention mask

Breakthrough Assessment

8/10

Offers a highly efficient, single-forward-pass integration of goal-conditioned contrastive reinforcement learning into VLM pretraining, bypassing the need for explicit reward labels while demonstrably improving long-horizon reasoning.

⚙️ Technical Details

Problem Definition

Setting: Language-conditioned robotic manipulation task with visual observations modeled as a Markov Decision Process under an imitation learning paradigm

Inputs: Natural language instruction l, multi-view RGB images I, and robot proprioceptive state q_t

Outputs: Low-level continuous control commands (action chunk a_t:t+H)

Pipeline Flow

Input Encoding: Vision & State Tokens + Instruction Tokens
Reasoning & Representation: VLM Backbone (Qwen3-VL) with Role-Aware Causal Mask
Action Generation: Flow-matching Action Expert (DiT)

System Modules

Vision & State Encoder

Encodes multi-view egocentric/wrist RGB images and discretizes robot proprioceptive states

Model or implementation: Qwen ViT Encoder

VLM Backbone

Jointly performs Auto-Regressive (AR) action prediction and extracts Contrastive Reinforcement Learning (CRL) representations

Model or implementation: Qwen3-VL

Action Expert

Translates high-level reasoning from the VLM into high-frequency continuous control commands

Model or implementation: Diffusion Transformer (DiT)

Novel Architectural Elements

Appends specialized auxiliary token blocks (<CRL_action> and <CRL_goal>) to the end of the standard VLM input sequence
Role-aware causal mask: AR actions retain standard causal attention; <CRL_action> tokens attend only to vision/state/themselves; <CRL_goal> tokens attend only to themselves, enforcing strict representation isolation within a single forward pass

Modeling

Base Model: Qwen3-VL

Training Method: Two-stage training: unified VLM pre-training (CRL + BC) followed by post-training the DiT action expert

Objective Functions:

Purpose: Teach the model to identify the correct language instruction for a given state-action pair (Task-level discrimination).

Formally: State-Action to Language loss L^{sa->l} using cross-entropy weighted by temporal distance gamma^(T-t).
Purpose: Teach the model to encode the temporal distance to task completion for a specific language goal.

Formally: Language to State-Action loss L^{l->sa} using soft targets derived from temporal distances.
Purpose: Standard behavior cloning on discrete auto-regressive action tokens.

Formally: Cross-entropy loss L_BC.
Purpose: Train the action expert to generate continuous action chunks.

Formally: Conditional flow-matching loss L_flow over continuous actions.

Adaptation: Full fine-tuning of VLM backbone during pre-training; continuous-action expert attached and fine-tuned during post-training

Training Data:

Pre-trained on over 167 Billion tokens of diverse manipulation and embodied-reasoning data

Compute: Pre-training executed on 64 H100 GPUs for one week, leveraging sequence packing optimizations

Comparison to Prior Work

vs. OpenVLA / RT-2: Integrates contrastive reinforcement learning for goal-reachability awareness rather than relying entirely on static behavior cloning
vs. pi_0.6_star / VLAC: Extracts dense value supervision natively from trajectory sequences within a single forward pass, eliminating the need for manual reward annotations or separate auxiliary value networks
vs. RoboFlamingo [not cited in paper]: Employs a single shared language-goal token structure with temporal weighting rather than cross-attention over explicitly tracked temporal state histories for imitation

Limitations

Assumes the language instruction remains constant across all timesteps of a trajectory, which may struggle with complex, multi-stage tasks requiring dynamically changing sub-goals
Implicit dense goal-reachability assumes that the offline demonstration trajectories roughly follow an optimal or near-optimal expert policy
No detailed statistical significance tests or exact experimental ablation tables were provided in the methodology snippet

Reproducibility

Code: https://github.com/TeleHuman/PRTS

Code and model weights (TeleEmbodied/PRTS-4B) are publicly available. Exact pre-training hyperparameters (learning rate, batch size) and granular benchmark score tables were not included in the provided methodology text.

📊 Experiments & Results

Evaluation Setup

Extensive validation in simulation environments and zero-shot novel-instruction physical deployment across single and dual-arm robotic systems

Benchmarks:

LIBERO (Robotic manipulation)
LIBERO-Pro (Robotic manipulation)
LIBERO-Plus (Robotic manipulation)
SimplerEnv (WidowX) (Robotic manipulation)
Real-world suite (Complex real-world dual/single-arm manipulation) [New]

Metrics:

Execution success rate
Long-horizon planning robustness
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

PRTS achieves state-of-the-art performance across major simulation benchmarks including LIBERO, LIBERO-Pro, LIBERO-Plus, and SimplerEnv
Substantial improvements are observed in long-horizon, contact-rich environments and zero-shot generalization to novel instructions compared to pure behavior cloning models
Robust real-world performance validated across 14 complex manipulation tasks on dual-arm RealMan and single-arm Flexiv hardware platforms, showcasing strong recovery capabilities under human interventions
Empirically confirms that equipping the reasoning backbone with an intrinsic, quantitative sense of goal reachability significantly enhances general-purpose robotic policy execution

📚 Prerequisite Knowledge

Prerequisites

Vision-Language-Action (VLA) models
Behavior Cloning (BC) in robotics
Goal-Conditioned Reinforcement Learning (GCRL)

Key Terms

VLA: Vision-Language-Action models that integrate visual perception and text understanding to directly output robotic control actions

BC: Behavior Cloning—training a model to directly mimic expert actions using supervised learning

CRL: Contrastive Reinforcement Learning—estimating future reward probabilities by pulling reachable state-goals together and pushing random goals apart in embedding space

GCRL: Goal-Conditioned Reinforcement Learning—a framework where an agent learns to achieve multiple specific goals dynamically rather than maximizing a single global reward

Discounted State Occupancy Measure: The probability distribution of states an agent will visit in the future, with states further in the future discounted exponentially

MoT: Mixture-of-Transformers—a dual-system architecture where one transformer handles high-level reasoning and another handles low-level control

FAST tokenization: A specific tokenization method to convert continuous robotic action chunks into discrete auto-regressive tokens for processing by a language model

DiT: Diffusion Transformer—a neural architecture used here as an action expert for generating continuous robotic trajectories

Flow-matching: A generative modeling technique used to predict continuous robotic action chunks by learning a vector field that transports a simple distribution to the target data distribution

CuTe-FlashAttention: A highly optimized custom GPU attention kernel developed to support structurally sparse role-aware masks efficiently without performance degradation

AR: Auto-Regressive—generating tokens one by one, where each new token depends on all previously generated tokens