Skip to content

OpenRobotLab/Grounded_3D-LLM

Repository files navigation

Grounded 3D-LLM with Referent Tokens

This repository will release the official implementation of "Grounded 3D-LLM with Referent Token".

[Paper] [Arxiv] [Website] [Data]

Abstract

Prior studies on 3D scene comprehension have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (LMMs) to consolidate various 3D visual tasks within a unified generative framework. The model utilizes a series of ``referent tokens'' to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data arbitrarily. 3D vision tasks are naturally transformed into language formats through task-specific prompts. To effectively associate the scene with text, we curate the grounded language datasets either from human-annotated sources or by bootstrapping existing object labels at the phrase level. We then employ Contrastive Language-Scene Pre-training (CLASP) to bridge the divide between 3D vision and language models, thus facilitating the use of referent tokens in subsequent language modeling tasks. Our comprehensive evaluation covers open-ended tasks like 3D visual question answering and dense captioning, as well as close-ended tasks such as object detection and language grounding.

image-20240515195822834

Grounded Scene Caption Data Visualization

Please refer to the data visualization page for detailed instructions on the minimal setup for visualizing the grounded scene caption dataset.

Model Training

Step 1: Environment setups and dataset preparation.

Grounded 3D-LLM is trained using 4 or 8 NVIDIA Tesla A100 GPUs. Please refer to the installation page for detailed installation scripts for model training.

Please download all the scene-language datasets the from HuggingFace. The datasets are listed as follows:

Dataset # for Train # for Eval
ScanRefer 36639 9503
Scan2Cap 36639 9503
ScanQA 26516 9402
Object-Description 28197 7912
GroundedSceneCaption 84301 --
EmbodiedPlanning 3500 --
EmbodiedDialogue 129799 --
GlobalSceneCaption 4065 --
3D-LLM 27627 --
Alpaca 51865 --

Please download the pretrained weights from HuggingFace and place them in $ROOT_PATH/pretrained/.

Please download the pretrained LLM weights (Tiny-Vicuna-1B) and store them in $ROOT_PATH/pretrained/llm_weight/Tiny-Vicuna-1B/

If you would like to utilize our pretrained model checkpoints, they can be obtained from HuggingFace. Please save these in the checkpoint directory located at $ROOT_PATH/saved.

Steps Model Checkpoints
1 Mask3D-CLIP
2 Mask3D-CLASP
3 Grounded 3D-LLM

After completing the downloads, the root folder should be organized as follows:

ROOT_PATH
├── data                            # data
│   ├── langdata
│   │   │── groundedscenecaption_format.json
│   │   │── scanrefer_format.json
│   │   │── scanqa_format.json
│   │   │── ...
│   ├── processed
│   │── rawscannet
│   │   │── scans
│   │   │── scannetv2-labels.combined.tsv
│── pretrained                      # pretrained weights for model training
│   │── bert-base-uncased           # bert pretrained weights
│   │── label_clip_features.pth     # clip's text features for scannet-200 class names
│   │── llm_weight
│   │   │── Tiny-Vicuna-1B          # pretrained weights from https://huggingface.co/Jiayi-Pan/Tiny-Vicuna-1B
│── saved                           # model checkpoints saved path
│   │── step1_mask3d_clip_4GPUS
│   │── step2_mask3d_lang_4GPUS
│   │── step3_mask3d_lang_4GPUS

Step 2: Pre-train the Mask3D detector:

bash final_scripts/step1_pretrain_detector.sh

Step 3: After training the detector, pre-train the detector using Contrastive Language-Scene Pre-training:

bash final_scripts/step2_pretrain_3d-clasp.sh

Step 3: After contrastive pre-training, train and inference the entire Grounded 3D-LLM:

bash final_scripts/step3_train_grounded3dllm.sh

The model checkpoints will be saved in saved/step3_mask3d_lang_4GPUS/last-checkpoint.pth, and the inference results will be stored in saved/step3_mask3d_lang_4GPUS/${TIMESTAMP}/.

Note that if you want to run the inference, simply comment out the training command in these scripts.

Model Evaluation

To evaluate all the respective results, run the following command:

bash final_scripts/eval_llm.sh ./saved/step3_mask3d_lang_4GPUS/${TIMESTAMP}/

Demo

To interact with Grounded 3D-LLM via the demo chat, first run the model inference and ensure that the scene_features are saved in saved/step3_mask3d_lang_4GPUS/scene_features. After that, launch the gradio demo chat by running the following command:

bash web_chat_demo/web_chat_demo.sh 

Please note that the visualization of the related segmentation masks is not yet supported in the Gradio demo.

ToDo List

  • Release Grouded Scene Caption data (ScanNet).
  • Release data visualizer.
  • Release data generation code.
  • Release pre-trained checkpoints.
  • Release Grounded 3D-LLM training and evaluation.
  • Demo supports mask visualization.

Acknowledgement

Many thanks to the following open-source projects: