VITRA Teleoperation Dataset

Dataset Summary

This dataset contains real-world robot teleoperation demonstrations collected using a 7-DoF robotic arm equipped with a dexterous hand and a head-mounted RGB camera. Each episode provides synchronized numerical state/action data and video recordings.

Hardware Setup

Robot Arm: Realman Arm (7-DoF)
URDF: https://github.com/RealManRobot/rm_models/tree/main/RM75/urdf/RM75-6F
Dexterous Hand: XHand (12-DoF)
Head Camera: Intel RealSense D455

Data Modalities and Files

Each episode consists of two synchronized files:

<episode_id>.h5 — numerical data including robot states, actions, kinematics, and metadata
<episode_id>.mp4 — RGB video stream recorded from the head-mounted camera

The two files correspond one-to-one and share the same episode identifier.

Coordinate Frames

The dataset uses the following coordinate frames:

arm_base
Root frame of the arm kinematic chain, defined in the URDF.
ee_urdf
End-effector frame defined in the URDF (joint7).
hand_mount
Rigid mounting frame of the dexterous hand, including flange offset.
This frame is rotationally aligned with the human hand axis illustrated in Figure 1 (identity rotation).
head_camera
Optical center of the head-mounted RGB camera.

Figure 1. The hand_mount frame axes. Axis directions follow the human hand definition illustrated in the figure.

Arm Availability and Masks

The dataset format is compatible with both right-arm-only episodes and dual-arm episodes. The currently released dataset contains only right-arm data.

Missing arms/hands are filled with zeros to keep array shapes consistent.
Availability is indicated by:
- /meta/has_left, /meta/has_right (episode-level)
- /mask/* (frame-level)

HDF5 File Structure

Each .h5 file follows the structure below:

/
├── meta/
│   ├── instruction                     string
│   ├── video_path                      string
│   ├── frame_count                     int # T
│   ├── fps                             float
│   ├── has_left                        bool
│   ├── has_right                       bool
│
├── kinematics/
│   ├── left_ee_urdf_to_hand_mount      (4, 4) float64
│   ├── right_ee_urdf_to_hand_mount     (4, 4) float64
│   ├── head_camera_to_left_arm_base    (4, 4) float64
│   └── head_camera_to_right_arm_base   (4, 4) float64
│
├── observation/
│   └── camera/
│       └── intrinsics                   (3, 3) float64
│
├── state/
│   ├── left_arm_joint                   (T, Na) float64  # joint positions (rad)
│   ├── right_arm_joint                  (T, Na) float64
│   ├── left_hand_mount_pose             (T, 6)  float64  # hand_mount pose in arm_base: [x,y,z,rx,ry,rz]
│   ├── right_hand_mount_pose            (T, 6)  float64  # hand_mount pose in arm_base: [x,y,z,rx,ry,rz]
|   ├── left_hand_mount_pose_in_cam      (T, 6)  float64  # hand_mount pose in head_camera: [x,y,z,rx,ry,rz]
|   ├── right_hand_mount_pose_in_cam     (T, 6)  float64  # hand_mount pose in head_camera: [x,y,z,rx,ry,rz]
│   ├── left_hand_joint                  (T, Nh) float64
│   └── right_hand_joint                 (T, Nh) float64
│
├── action/
│   ├── left_arm_joint                   (T, Na) float64  # target joint positions (rad)
│   ├── right_arm_joint                  (T, Na) float64  # target joint positions (rad)
│   ├── left_hand_joint                  (T, Nh) float64  # target joint positions (rad)
│   └── right_hand_joint                 (T, Nh) float64  # target joint positions (rad)
│
└── mask/
    ├── left_arm                         (T,) bool
    ├── right_arm                        (T,) bool
    ├── left_hand                        (T,) bool
    └── right_hand                       (T,) bool

Pose Representation

For all *_hand_mount_pose entries, poses are represented as:

[x, y, z, rx, ry, rz]

where:

(x, y, z) denotes the position of the hand_mount frame expressed in arm_base (meters)
(rx, ry, rz) denotes the rotation vector in axis–angle representation (radians)

Transformation Notation

A homogeneous transformation matrix is denoted by T (4×4).

Subscript: reference frame (the coordinate system used for expression)
Superscript: target frame (the frame being described)

All subscripts and superscripts are written on the right-hand side of T.

Example: $T^{hand_mount}_{arm_base}$ represents the pose of hand_mount expressed in the arm_base frame.

Kinematic Relations and Episode-Specific Transforms

Different flange hardware or camera mounting configurations may be used across episodes or arms. As a result:

All kinematic and extrinsic transforms must be read from the current episode and must not be assumed constant.

The hand mounting pose expressed in arm_base is computed as:

$$ T^{hand_mount}_{arm_base}

T^{ee_urdf}{arm_base} \cdot T^{hand_mount}{ee_urdf} $$

where:

$T^{ee_urdf}_{arm_base}$ is obtained via forward kinematics (FK) from the arm joint positions, corresponding to the URDF end-effector frame (joint7).
$T^{hand_mount}_{ee_urdf}$ is a fixed, episode-specific transform provided under /kinematics/*_ee_urdf_to_hand_mount.

Camera extrinsics may also vary across episodes.
Transforms under /kinematics/head_camera_to_*_arm_base should likewise be read from the current episode and must not be assumed constant. The hand mounting pose expressed in head_camera frame (i.e. *_hand_mount_pose_in_cam) is:

$$ T^{hand_mount}_{head_camera}

(T^{head_camera}{arm_base})^{-1} \cdot T^{hand_mount}{arm_base} $$

where:

$T^{head_camera}_{arm_base}$ is episode-specific transform provided under /kinematics/head_camera_to_*_arm_base

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VITRA Teleoperation Dataset

Dataset Summary

Hardware Setup

Data Modalities and Files

Coordinate Frames

Arm Availability and Masks

HDF5 File Structure

Pose Representation

Transformation Notation

Kinematic Relations and Episode-Specific Transforms

$$ T^{hand_mount}_{arm_base}

$$ T^{hand_mount}_{head_camera}

FilesExpand file tree

teleoperate_data.md

Latest commit

History

teleoperate_data.md

File metadata and controls

VITRA Teleoperation Dataset

Dataset Summary

Hardware Setup

Data Modalities and Files

Coordinate Frames

Arm Availability and Masks

HDF5 File Structure

Pose Representation

Transformation Notation

Kinematic Relations and Episode-Specific Transforms

$$ T^{hand_mount}_{arm_base}

$$ T^{hand_mount}_{head_camera}