Teaching robots in the real world to respond to natural language queries with zero human labels — using pretrained large language models (LLMs), visual language models (VLMs), and neural fields.
[Paper] [Website] [Code] [Data] [Video]
Authors: Mahi Shafiullah, Chris Paxton, Lerrel Pinto, Soumith Chintala, Arthur Szlam.
warm_up_my_lunch.mp4
Tl;dr CLIP-Field is a novel weakly supervised approach for learning a semantic robot memory that can respond to natural language queries solely from raw RGB-D and odometry data with no extra human labelling. It combines the image and language understanding capabilites of novel vision-language models (VLMs) like CLIP, large language models like sentence BERT, and open-label object detection models like Detic, and with spatial understanding capabilites of neural radiance field (NeRF) style architectures to build a spatial database that holds semantic information in it.
To properly install this repo and all the dependencies, follow these instructions.
# Clone this repo.
git clone --recursive https://github.com/notmahi/clip-fields
cd clip-fields
# Create conda environment and install the dependencies.
conda create -n cf python=3.8
conda activate cf
conda install -y pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch-lts -c nvidia
pip install -r requirements.txt
# Install the hashgrid encoder with the relevant cuda module.
cd gridencoder
# For this part, it may be necessary to find out what your nvcc path is and use that,
# For me $which nvcc gives public/apps/cuda/11.1/bin/nvcc, so I used the following part
# export CUDA_HOME=/public/apps/cuda/11.1
python setup.py install
cd ..
We have an interactive tutorial and evaluation notebook that you can use to explore the model and evaluate it on your own data. You can find them in the demo/
directory, that you can run after installing the dependencies.
Once you have the dependencies installed, you can run the training script train.py
with any .r3d files that you have! If you just want to try out a sample, download the sample data nyu.r3d
and run the following command.
python train.py dataset_path=nyu.r3d
If you want to use LSeg as an additional source of open-label annotations, you should download the LSeg demo model and place it in the path_to_LSeg/checkpoints/demo_e200.ckpt
. Then, you can run the following command.
python train.py dataset_path=nyu.r3d use_lseg=true
You can check out the config/train.yaml
for a list of possible configuration options. In particular, if you want to train with any particular set of labels, you can specify them in the custom_labels
field in config/train.yaml
.
We would like to thank the following projects for making their code and models available, which we relied upon heavily in this work.
- CLIP with MIT License
- Detic with Apache License 2.0
- Torch NGP with MIT License
- LSeg with MIT License
- Sentence BERT with Apache License 2.0