Inspired by Model Cards for Model Reporting (Mitchell et al.) and Lessons from Archives (Jo & Gebru), we’re providing some accompanying information about the VIMA model.
VIMA (VisuoMotor Attention, reads "v-eye-ma") is a novel Transformer model that ingests multimodal prompts and outputs robot arm control autoregressively. VIMA is developed primarily by researchers at NVIDIA.
October 2022
VIMA model consists of a pretrained T5 model as the prompt encoder, several tokenizers to process multimodal inputs, and a causal decoder that autoregressively predicts actions given the prompt and interaction history.
We released 7 checkpoints covering a spectrum of model capacity from 2M to 200M.
The model is intended to be used alongside VIMA-Bench to study general robot manipulation with multimodal prompts.
The primary intended users of these models are AI researchers in robotics, multimodal learning, embodied agents, foundation models, etc.
The models were trained with data generated by oracles implemented in VIMA-Bench. It includes 650K successful trajectories for behavior cloning. We use 600K trajectories for training. The remaining 50K trajectories are held out for validation purpose.
We quantify the performance of trained models using task success percentage aggregated over multiple tasks. We evaluate models' performance on task suite from VIMA-Bench and follow the proposed evaluation protocol. See our paper for more details.
Our provided model checkpoints are pretrained on VIMA-Bench, which may not directly generalize to other simulators or real world. Techniques like SECANT can be applied to enable sim2real transfer.
Our paper is posted on arXiv. If you find our work useful, please consider citing us!
@article{jiang2022vima,
title = {VIMA: General Robot Manipulation with Multimodal Prompts},
author = {Yunfan Jiang and Agrim Gupta and Zichen Zhang and Guanzhi Wang and Yongqiang Dou and Yanjun Chen and Li Fei-Fei and Anima Anandkumar and Yuke Zhu and Linxi Fan},
year = {2022},
journal = {arXiv preprint arXiv: Arxiv-2210.03094}
}