-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added whisperX support #125
base: main
Are you sure you want to change the base?
Conversation
Love it! Will test this when I can. |
I tested it and was able to get it working. Great work! Please take a look at the changes in 1.2 and fix merge conflicts and I will approve it. 👍 |
I updated the code to resolve the merge conflicts. Since the documentation was moved from the Readme, i need to update the documentation accordingly. So this PR is not yet ready to get merged. |
I'm trying to test on my end, cloned your repo and failed to build on my Macbook. Will try on another machine shortly.
|
You need gcc from xcode or homebrew. Also, poetry is a huge pain in the ass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this failed on my Macbook and you two aren't seeing it...
@@ -7,6 +7,7 @@ RUN export DEBIAN_FRONTEND=noninteractive \ | |||
&& apt-get -qq update \ | |||
&& apt-get -qq install --no-install-recommends \ | |||
ffmpeg \ | |||
git \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add gcc and python3-dev packages here
gcc \
python3-dev \
Dockerfile.gpu
Outdated
@@ -11,6 +11,7 @@ RUN export DEBIAN_FRONTEND=noninteractive \ | |||
python${PYTHON_VERSION}-venv \ | |||
python3-pip \ | |||
ffmpeg \ | |||
git \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here as well...
gcc \
python3-dev \
This was within the docker container. I haven't tried running it natively. I added notes and changes. Thanks for this. |
Oh, apologies. I'll look into this for you and try testing again. I don't know how it built for me without gcc. |
I can replicate your issue when building the docker image on Apples M1. This is probably related to an missing precompiled python wheel, causing the arm architecture to require a compiler on build. While your suggested fix solves the build issue for me, i still run into issues when trying to transcribe an MP3, causing a crash of the Docker container: [2023-10-09 10:44:49 +0000] [31] [INFO] Started server process [31] [2023-10-09 10:44:49 +0000] [31] [INFO] Waiting for application startup. [2023-10-09 10:44:49 +0000] [31] [INFO] Application startup complete. [2023-10-09 10:45:29 +0000] [1] [WARNING] Worker with pid 31 was terminated due to signal 11 [2023-10-09 10:45:29 +0000] [55] [INFO] Booting worker with pid: 55 /app/.venv/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call. torchaudio.set_audio_backend("soundfile") /app/.venv/lib/python3.10/site-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call. torchaudio.set_audio_backend("soundfile") torchvision is not available - cannot save figures [2023-10-09 10:45:32 +0000] [55] [INFO] Started server process [55] [2023-10-09 10:45:32 +0000] [55] [INFO] Waiting for application startup. [2023-10-09 10:45:32 +0000] [55] [INFO] Application startup complete This issue persists, even when setting the ASR_ENGINE to openai_whisper, but not when using onerahmet/openai-whisper-asr-webservice:latest as base image. |
I have successfully run previous versions of the ASR engine, in Docker containers, on both the M1 and WSL Cuda. Last night, on my WSL box, I attempted running the DennisTheD:main image, and am able to use the swagger interface to render a test file using the whisper x engine. Diarization tests using txt output rendered the transcript, without diarization notations. It did not use cuda, but the CPU instead. Attempts at trying diarization with other file format caused an exception in the SRT/VTT export, I don't recall which one. What is it you need me to validate? M1 native or Docker? |
Tested with docker with GPU. Standard transcriptions work without diarization (diarize=false). Testing Prep# Working Dir
WORKING_DIRECTORY="/mnt/user/docker/whisper-asr-webservice"
mkdir -p "${WORKING_DIRECTORY}"
cd ${WORKING_DIRECTORY}
# Make Folders & Files
mkdir -p ./cache/{pip,poetry,whisper,faster-whisper}
ls -alt ${WORKING_DIRECTORY}/cache
# Clone Repository
git clone https://github.com/DennisTheD/whisper-asr-webservice.git whisper-asr-webservice_DennisTheD
# https://github.com/ahmetoner/whisper-asr-webservice/pull/125
# NOTE: The engine can be activated by setting the ASR_ENGINE to "whisperx". In order to use the diarization pipeline, a Huggingface access token needs to be supplied, using the "HF_TOKEN" variable. You also need to accept some user agreements (see https://github.com/m-bain/whisperX for further details). If you do not need diarization, the token is not required.
cd whisper-asr-webservice_DennisTheD/
# git clean -fd
# git reset --hard
git pull
cd .. Docker Copose File
Docker Build & Run
Error Message:
|
@AustinSaintAubin Did you provide the HF_TOKEN? That's required for diarization. |
testing:
additional thoughts:
|
I haven't seen anything in the whisper community that can run M1 on anything other than CPU.
Again, the default engine runs fine with CUDA using the current docker image on my Win10 machine, although now I'm starting to question whether I pulled that in WSL or not. I've also been able to run https://github.com/MahmoudAshraf97/whisper-diarization in WSL and GPU support, but I remember I had some issues getting it going bc dependencies. So I guess what I'm asking is whether this is a whisperx issue or something with my setup. |
I am able to get WhisperX on CUDA working with WSL. Can you post which GPU drivers you have? It could be a weird issue with driver and CUDA incompatibility. I am running 537.13 on a Geforce RTX 2080 Ti. |
(thanks for notifying me @ayancey) I built the docker gpu image. And I had some problems related to the HF_TOKEN, where it likely wouldn't get recognized from the docker-compose.yml. Or maybe there was a delay with the accepted user conditions. The container exited with:
the env of my docker-compose.yml:
So I pasted it in ./app/mbain_whisperx/core.py and that got it to work. Will need to restest with the env again. Is batched inferencing being used so far? The large model used almost 10GB Vram of my 3060 and it wasn't more perfomant/faster than the normal faster-whisper implementation. Will test more and update when I've the time 👍 |
To be honest, I don't know. I'll do some benchmarks comparing the speed of all three backends. I'm most excited for the increased accuracy of timestamps and diarization. |
@Deathproof76 I'm not sure if it's the same, but the Whisper reqs noted 3 HF gated models that I needed to clear. There was one additional. See the note in your error: |
@ayancey I'm using 537.58. I originally ran the readme's cmd where you clone the image from docker hub. That's the one that uses CUDA. I'm going to go back to square one and see if I can pull it from source and have it run the same. Right now I have this PR as a separate remote and I'm not comparing apples to apples. |
I have testing again with environment variable set, and checked the dependent pipline /pyannote/speaker-diarization-3.0 is not gated to me... still not able to download the 'pyannote/speaker-diarization-3.0' pipeline. Not sure if HF_TOKEN is being passed or handled correctly.
https://huggingface.co/pyannote/speaker-diarization-3.0
|
Next step for me will probably be looking at the whisperx repo directly and see if I can get that to work anywhere first. |
Try making a new token. This took a couple tries for me to get working on the original WhisperX repo. I don't think its related to this PR. |
https://huggingface.co/pyannote/speaker-diarization https://huggingface.co/pyannote/segmentation Sorry had not mentioned, had already vistied all three repos and accepted EULAs. |
Fixed Docker caching issues Updated Readme
I think pyannote released a newer segmentation model v3 (https://huggingface.co/pyannote/segmentation-3.0). After accepting the EULA it should work fine (at least for on with Windows+WSL). |
That was it, at least for the CPU version (GPU version is still having the same issues as before); accepted EULA for segmentation-3.0) and now working as expected. |
app/mbain_whisperx/core.py
Outdated
|
||
if torch.cuda.is_available(): | ||
device = "cuda" | ||
model = whisper.load_model(model_name).cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use whisperx model for transcription?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback! I fixed this issue, so whisperx should also be used for transcription now.
Hi! I would really like to see the WhisperX support in the project, is it possible to somehow speed up the code review procedure? Maybe you need some help? |
Dear @m-bain, I have concerns regarding the |
Whisperx has a pretty fair license: https://github.com/m-bain/whisperX/blob/main/LICENSE |
So I can confirm that I was able to update the docker file by adding
And I can confirm that it's offloading to my gpu. That said, I still can't confirm the output yet, TXT files are no good and I get |
I am having the same or similar issue, where diarization does not seem to work. |
@ahmetoner thats fine yeah, sorry some of the formats not fully tested with diarize |
So which format actually works? |
|
I think that should be a requirement before this gets merged. (How are you using the JSON?) I use the ASR through Obsidian transcription plugin and it dumps the text directly, so I'll see what I can do with that and maybe figure a way to fix the others as well. |
Looks like the result writers for whisperX (WriteVTT / WriteSRT) already do that through the SubtitlesWriter class. So it should technically include the speakers already. I've haven't ran it myself to confirm yet. # add [$SPEAKER_ID]: to each subtitle if speaker is available
prefix = ""
if speaker is not None:
prefix = f"[{speaker}]: " |
Add Speakers to WriteVTT & WriteSRT
FYI, The commit If you want to get it working then edit Dockerfile or Dockerfile.gpu and change the clone of whisperX to the following to use whisperX before the breaking change
|
I accidentally built and deployed the buggy image file to docker, so if you need to patch after deploy you'll need to make sure you use Does anyone know what the underlying issue is? |
I created a PR to update the dockerfiles in this PR. DennisTheD#2 |
Hello @DennisTheD, |
@DennisTheD any chance you can help to resolve the conflicts so we can get this merged. Whisperx support for this project would be awesome. |
If I had access to the project I'd resolve it myself. If Dennis doesn't show up to finish it then we'll have to fork the fork and send a PR back upstream. Honestly though @hlevring you can run this fork just fine on it's own. I've had it deployed in my lab for months and even had it running in cloud at one point. |
Hey ;) Sorry for late reply. I currently don't have access to my dev machine and can not make / test the required changes. |
I added support for the whisperX engine.
The engine can be activated by setting the ASR_ENGINE to "whisperx". In order to use the diarization pipeline, a Huggingface access token needs to be supplied, using the "HF_TOKEN" variable. You also need to accept some user agreements (see https://github.com/m-bain/whisperX for further details). If you do not need diarization, the token is not required.