-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request + my code: audio cleanup #108
Comments
Have you tried using the |
AFAIK vad_filter is "voice activity detection" filter, so it's job is to crop the only the start of the audio, not silence in the middle or the end. Torchaudio even recommends using VAD then reverse + VAD + reverse to crop both ends. But it's not able to remove silence at other places. |
I think everyone who's using generic or too simple implementation of VAD is wrong because it detects energy (based on amplitude): "Energy in audio processing is typically calculated as the square of the amplitude values." Here is a better implementation for Whisper: https://github.com/wavey-ai/mel-spec?tab=readme-ov-file Basically you want to implement VAD based on frequency spectrum and there you can recognize speech much better than just noise vs quiet! Also this implementation can detect quiet between audio parts and supports streaming for whisper. Not just start or end. |
This seems to require significantly more work to setup than just using sox via torch, which maybe is even already a dep. But i'm not the owner and maybe his skills make it easy enough. In any case I personaly have never noticed the lack of precision of the "naive" amplitude based approach. |
@thiswillbeyourgithub What? Please, I did not want to offend anyone.
STFT, Mel-Spec, Quantization - that all is already included in PyTorch (torch, torchaudio) Everything this is already included in faster-whisper, which is a dependency. You can check it here: https://github.com/SYSTRAN/faster-whisper/blob/master/requirements.txt This is how to do it in PyTorch:
They have example of real time streaming from microphone and they perform it even in Javascript in the browser: https://github.com/wavey-ai/mel-spec/blob/main/examples/browser/worker.js ...but they say they test it on Macbook M2. I'm not sure if mobile devices would handle that.
Why you apply low pass and high pass filter from 50 to 5000? That is not frequency range of human voice:
Most low quality speech compression codecs work around 8KHz. But whisper has been trained on 16KHz audio, so I think we should not lower the input quality on purpose because it can loose precision for the prediction. But I did not test it, it is just my observation. Anyway I'm just talking from my experience as developer with machine learning and audio processing skills. I will fork it and make the best performance and quality for myself. You guys can decide how you find it feasible. |
Also if you want to use VAD, you should apply this commit from faster-whisper that fixes a few things and has not been merged to main branch: SYSTRAN/faster-whisper@2f6790a |
Hi. I'm really sorry it appears I answered without proper knowledge and wasted your time.
I only know python and 98% of this repo is in python so at first glance it appeared that adding this other code seemed like a more significant endeavor than a couple of lines of python. But again I'm not the owner so please don't assume they're as incompetent as I am!
Whoops indeed it's totally a mistake on my part. I don't recally exactly where I had those figure. But IIRC it was also because I prefered to deteriorate the quality of the recording but be sure to remove other noises that are also in that range.
Thank you for taking the time to explain me those aspects. Also I was just passing by and am just a user of this repo. Not affiliated with the owner who might completely disagree with my thoughts on this! And thanks for linking the other PR. |
Hi,
I'm a happy user of faster-whisper-server. I mainly use it as a whisper backend for open-web-ui and recently opened an issue to share my code for high quality audio cleaning that remove silences INSIDE the audio (whereas vad only handles the start of the audio). As I ran into issues several time with faster-whisper-server where a sentence would get repeated ad nauseam because I stopped talking for more than 30s in a recording, I thought about requesting this feature here and sending my code to deduplicate efforts.
Here's the link to the issue: open-webui/open-webui#5972
Here's the content :
The text was updated successfully, but these errors were encountered: