ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Hardware Setup

Our hardware setup consists of an active vision system, a bi-manual robot arm, and two high-DoF dexterous hands (SharpaWave) equipped with high-resolution tactile sensors.

To enable rich data collection, we use a teleoperation system with precision exoskeleton for fine-grained arm and finger control, plus a VR interface to control the camera and guide demonstrations.

Installation

conda create -n vitacformer python=3.8.10
conda activate vitacformer
pip install torchvision
pip install torch
pip install opencv-python
pip install matplotlib
pip install tqdm
pip install einops
pip install h5py
pip install ipython
pip install transforms3d
pip install zarr
pip install transformers
cd dataset/ha_data && pip install -e .
cd detr && pip install -e .

Example Usages

Please download and unzip the example data here. To train ViTacFormer, run:

conda activate vitacformer
bash train.sh

Citation

If you find ViTacFormer useful, please consider citing it:

@misc{heng2025vitacformerlearningcrossmodalrepresentation,
      title={ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation}, 
      author={Liang Heng and Haoran Geng and Kaifeng Zhang and Pieter Abbeel and Jitendra Malik},
      year={2025},
      eprint={2506.15953},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2506.15953}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Hardware Setup

Installation

Example Usages

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Hardware Setup

Installation

Example Usages

Citation