Robo-MUTUAL

Robotic Multimodal Task Specification via Unimodal Learning

Jianxiong Li* ^† ¹, Zhihao Wang* ² ¹, Jinliang Zheng* ¹ ³,
Xiaoai Zhou⁴ ¹, Guanming Wang⁵ ¹, Guanglu Song³, Yu Liu³, Jingjing Liu¹, Ya-Qin Zhang¹,
Junzhi Yu^✉², Xianyuan Zhan^✉ ¹ ⁶

¹AIR, Tsinghua University ²CoE, Peking University ³SenseTime Research
⁴University of Toronto ⁵University College London ⁶Shanghai AI Lab

Accepted by

Abstract

Multimodal task specification is essential for enhanced robotic performance, where Cross-modality Alignment enables the robot to holistically understand complex task instructions. Directly annotating multimodal instructions for model training proves impractical, due to the sparsity of paired multimodal data. In this study, we demonstrate that by leveraging unimodal instructions abundant in real data, we can effectively teach robots to learn multimodal task specifications. First, we endow the robot with strong Cross- modality Alignment capabilities, by pretraining a robotic multimodal encoder using extensive out-of-domain data. Then, we employ two Collapse and Corrupt operations to further bridge the remaining modality gap in the learned multimodal representation. This approach projects different modalities of identical task goal as interchangeable representations, thus enabling accurate robotic operations within a well-aligned multimodal latent space. Evaluation across more than 130 tasks and 4000 evaluations on both simulated LIBERO benchmark and real robot platforms showcases the superior capabilities of our proposed framework, demonstrating significant advantage in overcoming data constraints in robotic learning.

Transfer from Unimodal to Multimodal Goals

We train Robo-MUTUAL constrained on dataset that only visual/textual goals are available, and evaluate the performance on textual&visual goals.

Results on Real Robots

Figure 1: Real robot experimental results. Success rate is averaged over 10 episodes and 3 seeds.

Put duck on green plate

train on Visual eval on Textual

Given task: "put the duck in the green plate"

train on Textual eval on Visual

Given task:

Put duck in pot

train on Visual eval on Textual

Given task: "put the duck in the pot"

train on Textual eval on Visual

Given task:

Move pot from right to left

train on Visual eval on Textual

Given task: "move the pot from right to left"

train on Textual eval on Visual

Given task:

Put red cup on red plate

train on Visual eval on Textual

Given task: "put the red cup on the red plate"

train on Textual eval on Visual

Given task:

Flip red cup upright

train on Visual eval on Textual

Given task: "flip the red cup upright"

train on Textual eval on Visual

Given task:

Fold cloth from right to left

train on Visual eval on Textual

Given task: "fold the cloth from right to left"

train on Textual eval on Visual

Given task:

Results on Simulation

We train Robo-MUTUAL on 130 tasks on LIBERO benchmark. Robo-MUTUAL achieves the highest success rate when evaluated with modality which doesn't appear in training dataset, demonstrating its effectiveness in achieving multimodal task specification via unimodal training.

Figure 2: Simulation experimental results. Success rate is averaged over 10 episodes and 3 seeds.

BibTeX

If you find our code or paper can help, please cite our paper as:

@article{li2024robo, title={Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning}, author={Li, Jianxiong and Wang, Zhihao and Zheng, Jinliang and Zhou, Xiaoai and Wang, Guanming and Song, Guanglu and Liu, Yu and Liu, Jingjing and Zhang, Ya-Qin and Yu, Junzhi and Zhan, Xianyuan}, journal={arXiv preprint arXiv:2410.01529}, year={2024} }

Robo-MUTUAL

Robotic Multimodal Task Specification via Unimodal Learning

Abstract

Transfer from Unimodal to Multimodal Goals

Results on Real Robots

Put duck on green plate

Put duck in pot

Move pot from right to left

Put red cup on red plate

Flip red cup upright

Fold cloth from right to left

Results on Simulation

BibTeX

Acknowledgement