Asynchronously transcribe audio files split into chunks in parallel and intelligently join results, yielding nearly identical transcriptions to full audio transcriptions but in a fraction of the time.
Simply install the package via pip
. Additionally, ffmpeg
or libav
must be installed on the system.
pip install async-whisper
To set up ffmpeg
, follow the pydub
setup instructions.
The development installation requires software like git
and conda
to be already available on the system. Contributions are welcome. Please create a fork, make your changes there and then open a Pull Request onto the main GitHub repository.
# Clone the code
git clone [email protected]:DamianB-BitFlipper/async-whisper.git
cd async-whisper
# Install development dependencies
conda env create -f environment.yml
conda activate async-whisper
# Run a pip editable install
pip install -e .
The most useful API interface is the AsyncWhisper.transcribe_audio
async method. It accepts a pydub.AudioSegment
audio object and transcribes it asynchronously. The pydub library enables easy audio loading and manipulations and is a dependency of this project already -- no additional installation is required.
import asyncio
from pydub import AudioSegment
from async_whisper import AsyncWhisper
OPENAI_API_KEY = "<your-openai-api-key>"
AUDIO_FILEPATH = "/path/to/audio/file.mp3"
async def main():
audio_data = AudioSegment.from_mp3(AUDIO_FILEPATH)
whisper_client = AsyncWhisper(OPENAI_API_KEY)
transcription = await whisper_client.transcribe_audio(audio_data)
print(f"Transcription: {transcription}")
if __name__ == "__main__":
asyncio.run(main())
It's as simple as that!
More information on how to fine-tune the configuration of the AsyncWhisper
class can be found in the DOCUMENTATION.md file.
The high-level of how this library works is conceptually simple. The audio recording is split into chunks of 60 seconds. However, adjacent audio chunks share overlapped portions of 10 seconds. That is, a 3-minute audio recording (180s) will be split into 4 chunks whose boundaries are: [0 - 60, 50 - 110, 100 - 160, 150 - 180]
seconds.
Each audio chunk is asynchronously transcribe with OpenAI's Whisper in parallel.
Then using the overlapping portions, adjacent transcribed chunks are intelligently joined. The joining processes works by minimizing the Levenshtein distance between the overlapping segments of audio. Since Whisper transcribes the same audio data in these overlapping sections, it will yield similar, though not always identical, transcription outputs. Minimizing the Levenstein distance
is a good heuristic to merge two similar, but not necessarily identical, strings.
In the performance evaluation, async-whisper
is evaluated against simply transcribing the entire audio altogether using Whisper (baseline). Execution time as well as the output similarity of async-whisper
to the baseline using the Levenshtein distance are presented.
Similarity scores of ≥ 97% are essentially semantically identical transcriptions with differences only in punctuation.
Baseline | async-whisper |
Similarity Score |
---|---|---|
10.99s | 10.02s | 100% |
Baseline | async-whisper |
Similarity Score |
---|---|---|
32.49s | 7.71s | 97% |
Baseline | async-whisper |
Similarity Score |
---|---|---|
56.31s | 11.03s | 97% |