Skip to content

Communication Protocol

As introduced in the Quick Start section through the use of client, the platform employs a http-based communication mechanism to exchange structured data between the simulation controller and the inference model. You can extend the example in client to integrate your own model controller for benchmark evaluation.

At each simulation timestep, the system sends the following structured data through http. Note that in practice, you might receive additional fields — those are deprecated and will be removed in future releases.

data = {
"camera_data": camera_data, // Dict, camera observations
"instruction": instruction, // String, task instruction
"joint_position_state": joint_position_state, // np.ndarray, current joint positions
"ee_pose_state": ee_pose_state, // list[np.ndarray] or list[list[np.ndarray]], end-effector pose(s)
"timestep": step, // Int, current simulation step
"reset": reset, // Bool, whether to reset the environment or model
}
Field NameTypeDescription
camera_datadictData captured from the robot-mounted cameras.
instructionstringThe natural language command or task instruction given to the agent.
joint_position_statenp.ndarray (shape (9,))Current joint angles of the robot arm (e.g., Franka).
ee_pose_statelist[np.ndarray] or list[list[np.ndarray]]End-effector pose(s) of the robot in the robot frame. For single-arm robots, this is a list of two arrays — translation (3D) and orientation (4D, scalar-first quaternion). For dual-arm robots, it is a nested list for left and right arms, each with its own translation and orientation.
timestepintThe current simulation step index in the rollout.
resetboolWhether the environment or model should reset.

The model or controller must then respond with a structured action message.

Each entry in the camera_data dictionary contains the following fields:

Field NameTypeDescription
pnp.ndarray (shape (3,))Camera position in world coordinates.
qnp.ndarray (shape (4,))Camera orientation in world coordinates (quaternion, scalar-first).
rgbnp.ndarrayRGB image array. The resolution depends on the configuration in configs/cameras/.
depthnp.ndarrayDepth image array.
intrinsics_matrixnp.ndarray (shape (3, 3))Camera intrinsic matrix.

The model must return an action dictionary, which can represent either joint positions or end-effector (ee) poses.

Currently, delta joint positions are supported. If your model outputs absolute joint positions, you should convert them to deltas.

You need to include the control_type field in your returned content:

{
"control_type": "joint_position",
"action": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
}
  • Franka with Panda Hand

    • Array of length 9
    • First 7 elements: delta arm joint positions
    • Last 2 elements: gripper control
    • [0.04, 0.04] = fully open, [0.0, 0.0] = fully closed
  • Franka with RoboTiq Hand

    • Array of length 13
    • First 7 elements: delta arm joint positions
    • Last 6 elements: gripper control
    • [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] = fully open
    • [0.7853, 0.7853, -0.7853, -0.7853, -0.7853, -0.7853] = fully closed
  • Aloha

    • Array of length 16
    • 0:6 and 8:14: delta joint positions of left and right arms
    • 6:8 and 14:16: gripper control
    • [0.05, 0.05] = fully open, [0.0, 0.0] = fully closed

In this mode, delta ee poses are supported. If your model outputs absolute poses, convert them into deltas accordingly.

You need to include the control_type field in your returned content:

{
"control_type": "ee_pose",
"action": [
[0.001, 0.001, 0.001],
[1.0, 0.0, 0.0, 0.0],
[0.04, 0.04],
]
}
  • Franka with Panda Hand

    • Tuple of length 3:

      • 3D translation,
      • 4D quaternion orientation (scalar-first),
      • 2D gripper control ([0.04, 0.04] = open, [0.0, 0.0] = closed)
  • Franka with RoboTiq Hand

    • Tuple of length 3:

      • 3D translation,
      • 4D quaternion orientation (scalar-first),
      • 6D gripper control ([0.0,...] = open, [0.7853,...] = closed)
  • Aloha

    • Tuple of length 2: for left and right arms

    • Each arm’s tuple includes:

      • 3D translation,
      • 4D quaternion orientation (scalar-first),
      • 2D gripper control ([0.05, 0.05] = open, [0.0, 0.0] = closed)