Note: The project is in very initial stages and will change drastically in the near future. Things may break.
A simple integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language.
Capabilities:
- Segment objects with SAM2.1 using point prompts.
- Points can be obtained by prompting Molmo with natural language. Molmo can take inputs by the text box (typing) or Whisper via microphone (speech to text).
Run the Gradio demo using:
python app.py
sam2_molmo_whisper-2024-10-11_07.09.47.mp4
- Added tabbed interface for video segmentation. Process remains the same. Either prompt via text or voice, upload a video and get the segmentation maps of the objects.
git clone https://github.com/sovit-123/SAM_Molmo_Whisper.git
cd SAM_Molmo_Whisper
Install Pytorch, Hugging Face Transformers, and the rest of the base requirements.
pip install -r requirements.txt
It is highly recommended to clone SAM2 to a separate directory other than this project directory and run the installation commands.
git clone https://github.com/facebookresearch/sam2.git && cd sam2
pip install -e .
After installing the requirements install SpaCy's en_core_web_sm
model.
spacy download en_core_web_sm
python app.py