The idea behind this demo is to combine Meta's Segment Anything Model (SAM) and BLIP/BLIP2 to identify and segment arbitratry volumes (regardless of underlying mesh composition) in a 3D scene. The demo is built using THREE.js for rendering of the scene.
- 2D pixel coordinates (currently from pointer) are sent to the processing API along with a 2D render of the scene
- SAM creates masks of the pointed object
- BLIP identifies the object ⚒️ WORK IN PROGRESS ⚒️
- Optimisation (edge extraction) is applied to the mask to reduce the amount of points to be rendered
- 2D mask projected back into 3D space and rendered as a bounding box
Basically 2D to 3D space conversions, nothing fancy.
./notebooks
was my initial prototyping of the image segmentation and contains the logic what was eventually ported into a processing API for demo purposes./demo
contains the demo code for the web viewer which uses a modified three-gltf-viewer (for great scene defaults and ease of swapping 3D scenes) + processing api./demo/lib/api
contains the processing API (segmentation + identification of volumes + optimisation like edge extraction)./demo/lib/three
contains the adapters used in THREE for 2D to 3D projection and vice versa
The code requires python>=3.8
, as well as pytorch>=1.7 and torchvision>=0.8. Please follow the instructions here to install both PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVision with CUDA support is strongly recommended.
Also recommend using a package manager for python env - I use Mamba (fast conda clone)
- Download the pretrained weights
cd lib/sam
# Windows / Linux
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
# Mac
curl https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth -o sam_vit_h_4b8939.pth
- Install SAM
From the root directory
pip install git+https://github.com/facebookresearch/segment-anything.git
- Install dependencies
# Conda
conda install --file requirements.txt
# Mamba
mamba install --file requirements.txt
# Pip
pip install -r requirements.txt
HINT
You can test the installation by running notebooks/generate_mask.ipynb
While the demo works on CPU, it is strongly recommended to run it on GPU. The demo will automatically detect if a GPU is available and use it for processing.
From the root directory
- Start the processing server
flask --app demo/lib/api/app run
- In another shell, start the dev server
npm run dev --prefix demo
- THREE.js for rendering of scene
- three-gltf-viewer fork for demo scene loading and playground
- Segment Anything Model (SAM) for segmentation of scene renders
- BLIP/BLIP2 for identification/captioning of volumes in the scene
- Use BLIP for image identification/captioning
- Send smaller renders to the processing API, convert them back post processing
- Generate a 3D grid from multiple 2D maskes produced by SAM and use that for projection
- Use BVH for accrued raycasting performance
- Slow on CPU (most of the time is spent on segmentation/identification)
- The bounding box is not always accurate (especially when the camera is not facing the object directly on segmentation)
- To improve accuracy, you would need to take multiple renders of the scene from different angles and then combine the masks to get a more accurate bounding box (wether in processing using a virtual grid, or client side with an offscreen canvas?)