Communication Protocol

As introduced in the Quick Start section through the use of client, the platform employs a http-based communication mechanism to exchange structured data between the simulation controller and the inference model. You can extend the example in client to integrate your own model controller for benchmark evaluation.

Outgoing Data

At each simulation timestep, the system sends the following structured data through http. Note that in practice, you might receive additional fields — those are deprecated and will be removed in future releases.

data = {
    "camera_data": camera_data, // Dict, camera observations
    "instruction": instruction, // String, task instruction
    "joint_position_state": joint_position_state, // np.ndarray, current joint positions
    "ee_pose_state": ee_pose_state, // list[np.ndarray] or list[list[np.ndarray]], end-effector pose(s)
    "timestep": step, // Int, current simulation step
    "reset": reset, // Bool, whether to reset the environment or model
}

Field Definitions

Field Name	Type	Description
`camera_data`	dict	Data captured from the robot-mounted cameras.
`instruction`	string	The natural language command or task instruction given to the agent.
`joint_position_state`	`np.ndarray` (shape `(9,)`)	Current joint angles of the robot arm (e.g., Franka).
`ee_pose_state`	list[`np.ndarray`] or list[list[`np.ndarray`]]	End-effector pose(s) of the robot in the robot frame. For single-arm robots, this is a list of two arrays — translation (3D) and orientation (4D, scalar-first quaternion). For dual-arm robots, it is a nested list for left and right arms, each with its own translation and orientation.
`timestep`	int	The current simulation step index in the rollout.
`reset`	bool	Whether the environment or model should reset.

The model or controller must then respond with a structured action message.

Camera Data Structure

Each entry in the camera_data dictionary contains the following fields:

Field Name	Type	Description
`p`	`np.ndarray` (shape `(3,)`)	Camera position in world coordinates.
`q`	`np.ndarray` (shape `(4,)`)	Camera orientation in world coordinates (quaternion, scalar-first).
`rgb`	`np.ndarray`	RGB image array. The resolution depends on the configuration in `configs/cameras/`.
`depth`	`np.ndarray`	Depth image array.
`intrinsics_matrix`	`np.ndarray` (shape `(3, 3)`)	Camera intrinsic matrix.

Returned Action

The model must return an action dictionary, which can represent either joint positions or end-effector (ee) poses.

Joint Position Mode

Currently, delta joint positions are supported. If your model outputs absolute joint positions, you should convert them to deltas.

You need to include the control_type field in your returned content:

{
    "control_type": "joint_position",
    "action": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
}

Supported Formats

Franka with Panda Hand
- Array of length 9
- First 7 elements: delta arm joint positions
- Last 2 elements: gripper control
- [0.04, 0.04] = fully open, [0.0, 0.0] = fully closed
Franka with RoboTiq Hand
- Array of length 13
- First 7 elements: delta arm joint positions
- Last 6 elements: gripper control
- [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] = fully open
- [0.7853, 0.7853, -0.7853, -0.7853, -0.7853, -0.7853] = fully closed
Aloha
- Array of length 16
- 0:6 and 8:14: delta joint positions of left and right arms
- 6:8 and 14:16: gripper control
- [0.05, 0.05] = fully open, [0.0, 0.0] = fully closed

End-Effector Pose Mode

In this mode, delta ee poses are supported. If your model outputs absolute poses, convert them into deltas accordingly.

You need to include the control_type field in your returned content:

{
    "control_type": "ee_pose",
    "action": [
        [0.001, 0.001, 0.001],
        [1.0, 0.0, 0.0, 0.0],
        [0.04, 0.04],
    ]
}

Supported Formats

Franka with Panda Hand
- Tuple of length 3:
  - 3D translation,
  - 4D quaternion orientation (scalar-first),
  - 2D gripper control ([0.04, 0.04] = open, [0.0, 0.0] = closed)
Franka with RoboTiq Hand
- Tuple of length 3:
  - 3D translation,
  - 4D quaternion orientation (scalar-first),
  - 6D gripper control ([0.0,...] = open, [0.7853,...] = closed)
Aloha
- Tuple of length 2: for left and right arms
- Each arm’s tuple includes:
  - 3D translation,
  - 4D quaternion orientation (scalar-first),
  - 2D gripper control ([0.05, 0.05] = open, [0.0, 0.0] = closed)