This repository is an implementation of a vision-language-action model for robotics, including both training and inference code. It's currently heavily under development, so check back soon for more updates!
The current VLA implementation is based on PaliGemma but supports some nice features such as masked multi-modal inputs, flexible outputs, etc.
VLAx uses the experimental grain-oxe
format for robotics data.