Skip to content

Commit

Permalink
Merge pull request #37 from linto-ai/next
Browse files Browse the repository at this point in the history
Next
  • Loading branch information
Jeronymous authored Feb 19, 2024
2 parents 8efde69 + 3de194a commit 399c98d
Show file tree
Hide file tree
Showing 15 changed files with 156 additions and 62 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# LinTO-STT

LinTO-STT is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack),
which can currently work with Speech-To-Text (STT) models.
LinTO-STT is an API for Automatic Speech Recognition (ASR).

LinTO-STT can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector.

The following families of STT models are currently supported (please refer to respective documentation for more details):
* [Kaldi models](kaldi/README.md)
* [Whisper models](whisper/README.md)
Expand Down
2 changes: 1 addition & 1 deletion kaldi/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
FROM python:3.9
LABEL maintainer="irebai@linagora.com, rbaraglia@linagora.com"
LABEL maintainer="[email protected], jlouradour@linagora.com, dgaynullin@linagora.com"

ARG KALDI_MKL

Expand Down
3 changes: 1 addition & 2 deletions kaldi/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# LinTO-STT-Kaldi

LinTO-STT-Kaldi is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack)
based on Speech-To-Text (STT) models trained with [Kaldi](https://github.com/kaldi-asr/kaldi).
LinTO-STT-Kaldi is an API for Automatic Speech Recognition (ASR) based on models trained with [Kaldi](https://github.com/kaldi-asr/kaldi).

LinTO-STT-Kaldi can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector.

Expand Down
2 changes: 1 addition & 1 deletion kaldi/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ celery[redis,auth,msgpack]>=4.4.7
numpy>=1.18.5
flask>=1.1.2
flask-cors>=3.0.10
flask-swagger-ui>=3.36.0
flask-swagger-ui==3.36.0
flask-sock
gevent
gunicorn
Expand Down
4 changes: 3 additions & 1 deletion wait-for-it.sh
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ wait_for_wrapper()
return $WAITFORIT_RESULT
}

echo "NOCOMMIT wait-for-it $*"

# process arguments
while [[ $# -gt 0 ]]
do
Expand Down Expand Up @@ -173,7 +175,7 @@ fi

if [[ $WAITFORIT_CLI != "" ]]; then
if [[ $WAITFORIT_RESULT -ne 0 && $WAITFORIT_STRICT -eq 1 ]]; then
echoerr "$WAITFORIT_cmdname: strict mode, refusing to execute subprocess"
echoerr "$WAITFORIT_cmdname returns $WAITFORIT_CLI: strict mode, refusing to execute subprocess"
exit $WAITFORIT_RESULT
fi
exec "${WAITFORIT_CLI[@]}"
Expand Down
13 changes: 10 additions & 3 deletions whisper/.envdefault
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,18 @@ BROKER_PASS=
# STT MODELING PARAMETERS
############################################

# The model can be a path to a model, or a model name ("tiny", "base", "small", "medium", "large-v1", "large-v2" or "large-v3")
MODEL=medium
# The model can be a path to a model (e.g. "/root/.cache/whisper/large-v3.pt", "/root/.cache/huggingface/hub/models--openai--whisper-large-v3"),
# or a model size ("tiny", "base", "small", "medium", "large-v1", "large-v2" or "large-v3")
# or a HuggingFace model name (e.g. "distil-whisper/distil-large-v2")
MODEL=large-v3

# The language can be in different formats: "en", "en-US", "English", ...
# If not set or set to "*", the language will be detected automatically.
LANGUAGE=*

# Prompt to use for the model. This can be used to provide context to the model, to encourage disfluencies or a special behaviour regarding punctuation and capitalization.
PROMPT=

# An alignment wav2vec model can be used to get word timestamps.
# It can be a path to a model, a language code (fr, en, ...), or "wav2vec" to automatically chose a model for the language
# This option is experimental (and not implemented with ctranslate2).
Expand All @@ -30,7 +35,9 @@ LANGUAGE=*
############################################

# Device to use. It can be "cuda" to force/check GPU, "cpu" to force computation on CPU, or a specific GPU ("cuda:0", "cuda:1", ...)
# DEVICE=cuda:0
# DEVICE=cuda
# CUDA_DEVICE_ORDER=PCI_BUS_ID
# CUDA_VISIBLE_DEVICES=0

# Number of threads per worker when running on CPU
OMP_NUM_THREADS=4
Expand Down
2 changes: 1 addition & 1 deletion whisper/Dockerfile.ctranslate2
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
FROM ghcr.io/opennmt/ctranslate2:latest-ubuntu20.04-cuda11.2
LABEL maintainer="[email protected]"
LABEL maintainer="[email protected], jlouradour@linagora.com, dgaynullin@linagora.com"

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ffmpeg git

Expand Down
2 changes: 1 addition & 1 deletion whisper/Dockerfile.ctranslate2.cpu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
FROM python:3.9
LABEL maintainer="[email protected]"
LABEL maintainer="[email protected], jlouradour@linagora.com, dgaynullin@linagora.com"

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ffmpeg git

Expand Down
2 changes: 1 addition & 1 deletion whisper/Dockerfile.torch
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
FROM python:3.9
LABEL maintainer="[email protected]"
LABEL maintainer="[email protected], jlouradour@linagora.com, dgaynullin@linagora.com"

RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg

Expand Down
2 changes: 1 addition & 1 deletion whisper/Dockerfile.torch.cpu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
FROM python:3.9
LABEL maintainer="[email protected]"
LABEL maintainer="[email protected], jlouradour@linagora.com, dgaynullin@linagora.com"

RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg

Expand Down
131 changes: 95 additions & 36 deletions whisper/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,71 @@
# LinTO-STT-Whisper

LinTO-STT-Whisper is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack)
based on Speech-To-Text (STT) [Whisper models](https://openai.com/research/whisper).
LinTO-STT-Whisper is an API for Automatic Speech Recognition (ASR) based on [Whisper models](https://openai.com/research/whisper).

LinTO-STT-Whisper can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector.

## Pre-requisites

### Requirements

The transcription service requires [docker](https://www.docker.com/products/docker-desktop/) up and running.

For GPU capabilities, it is also needed to install
[nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).

### Hardware

To run the transcription models you'll need:
* At least 8Go of disk space to build the docker image.
* At least 8GB of disk space to build the docker image
and models can occupy several GB of disk space depending on the model size (it can be up to 5GB).
* Up to 7GB of RAM depending on the model used.
* One CPU per worker. Inference time scales on CPU performances.
* One CPU per worker. Inference time scales on CPU performances.

On GPU, approximate VRAM peak usage are indicated in the following table
for some model sizes, depending on the backend
(note that the lowest precision supported by the GPU card is automatically chosen when loading the model).
<table border="0">
<tr>
<td rowspan="3"><b>Model size</b></td>
<td colspan="4"><b>Backend and precision</b></td>
</tr>
<tr>
<td colspan="3"><b> [ct2/faster_whisper](whisper/Dockerfile.ctranslate2) </b></td>
<td><b> [torch/whisper_timestamped](whisper/Dockerfile.torch) </b></td>
</tr>
<tr>
<td><b>int8</b></td>
<td><b>float16</b></td>
<td><b>float32</b></td>
<td><b>float32</b></td>
</tr>
<tr>
<td>tiny</td>
<td colspan="3">1.5G</td>
<td>1.5G</td>
</tr>
<!-- <tr>
<td>bofenghuang/whisper-large-v3-french-distil-dec2</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr> -->
<tr>
<td>distil-whisper/distil-large-v2</td>
<td>2.2G</td>
<td>3.2G</td>
<td>4.8G</td>
<td>4.4G</td>
</tr>
<tr>
<td>large (large-v3, ...)</td>
<td>2.8G</td>
<td>4.8G</td>
<td>8.2G</td>
<td>10.4G</td>
</tr>
</table>

### Model(s)

Expand All @@ -23,17 +77,15 @@ and can occupy several GB of disk space.

LinTO-STT-Whisper has also the option to work with a wav2vec model to perform word alignment.
The wav2vec model can be specified either
* (TorchAudio) with a string corresponding to a `torchaudio` pipeline (e.g. "WAV2VEC2_ASR_BASE_960H") or
* (HuggingFace's Transformers) with a string corresponding to a HuggingFace repository of a wav2vec model (e.g. "jonatasgrosman/wav2vec2-large-xlsr-53-english"), or
* (TorchAudio) with a string corresponding to a `torchaudio` pipeline (e.g. `WAV2VEC2_ASR_BASE_960H`) or
* (HuggingFace's Transformers) with a string corresponding to a HuggingFace repository of a wav2vec model (e.g. `jonatasgrosman/wav2vec2-large-xlsr-53-english`), or
* (SpeechBrain) with a path corresponding to a folder with a SpeechBrain model

Default wav2vec models are provided for French (fr), English (en), Spanish (es), German (de), Dutch (nl), Japanese (ja), Chinese (zh).

But we advise not to use a companion wav2vec alignment model.
This is not needed neither tested anymore.

### Docker
The transcription service requires docker up and running.

### (micro-service) Service broker and shared folder
The STT only entry point in task mode are tasks posted on a message broker. Supported message broker are RabbitMQ, Redis, Amazon SQS.
Expand Down Expand Up @@ -63,14 +115,16 @@ cp whisper/.envdefault whisper/.env
| PARAMETER | DESCRIPTION | EXEMPLE |
|---|---|---|
| SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | `http` \| `task` |
| MODEL | Path to a Whisper model, type of Whisper model used, or HuggingFace identifier of a Whisper model. | \<ASR_PATH\> \| `large-v3` \| `distil-whisper/distil-large-v2` \| ... |
| MODEL | Path to a Whisper model, type of Whisper model used, or HuggingFace identifier of a Whisper model. | `large-v3` \| `distil-whisper/distil-large-v2` \| \<ASR_PATH\> \| ... |
| LANGUAGE | (Optional) Language to recognize | `*` \| `fr` \| `fr-FR` \| `French` \| `en` \| `en-US` \| `English` \| ... |
| PROMPT | (Optional) Prompt to use for the Whisper model | `some free text to encourage a certain transcription style (disfluencies, no punctuation, ...)` |
| ALIGNMENT_MODEL | (Optional) Path to the wav2vec model for word alignment, or name of HuggingFace repository or torchaudio pipeline | \<WAV2VEC_PATH\> \| `WAV2VEC2_ASR_BASE_960H` \| `jonatasgrosman/wav2vec2-large-xlsr-53-english` \| ... |
| CONCURRENCY | Maximum number of parallel requests | `3` |
| ALIGNMENT_MODEL | (Optional and deprecated) Path to the wav2vec model for word alignment, or name of HuggingFace repository or torchaudio pipeline | `WAV2VEC2_ASR_BASE_960H` \| `jonatasgrosman/wav2vec2-large-xlsr-53-english` \| \<WAV2VEC_PATH\> \| ... |
| DEVICE | (Optional) Device to use for the model | `cpu` \| `cuda` ... |
| CUDA_VISIBLE_DEVICES | (Optional) GPU device index to use, if several. We also recommend to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` on multi-GPU machines | `0` \| `1` \| `2` \| ... |
| CONCURRENCY | Maximum number of parallel requests | `2` |
| SERVICE_NAME | (For the task mode) queue's name for task processing | `my-stt` |
| SERVICE_BROKER | (For the task mode) URL of the message broker | `redis://my-broker:6379` |
| BROKER_PASS | (For the task mode only) broker password | `my-password` |
| BROKER_PASS | (For the task mode only) broker password | `my-password` \| (empty) |

#### MODEL environment variable

Expand All @@ -79,7 +133,7 @@ The model will be (downloaded if required and) loaded in memory when calling the
When using a Whisper model from Hugging Face (transformers) along with ctranslate2 (faster_whisper),
it will also download torch library to make the conversion from torch to ctranslate2.

If you want to preload the model (and later specify a path `ASR_PATH` as `MODEL`),
If you want to preload the model (and later specify a path `<ASR_PATH>` as `MODEL`),
you may want to download one of OpenAI Whisper models:
* Mutli-lingual Whisper models can be downloaded with the following links:
* [tiny](https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt)
Expand Down Expand Up @@ -144,26 +198,28 @@ The SERVICE_MODE value in the .env should be set to ```http```.
```bash
docker run --rm \
-p HOST_SERVING_PORT:80 \
-v ASR_PATH:/opt/model.pt \
--env-file whisper/.env \
linto-stt-whisper:latest
```

This will run a container providing an [HTTP API](#http-api) binded on the host HOST_SERVING_PORT port.

You may also want to mount your cache folder CACHE_PATH (e.g. "~/.cache") ```-v CACHE_PATH:/root/.cache```
in order to avoid downloading models each time.

Also if you want to specifiy a custom alignment model already downloaded in a folder WAV2VEC_PATH,
you can add option ```-v WAV2VEC_PATH:/opt/wav2vec``` and environment variable ```ALIGNMENT_MODEL=/opt/wav2vec```.
You may also want to add specific options:
* To enable GPU capabilities, add ```--gpus all```.
Note that you can use environment variable `DEVICE=cuda` to make sure GPU is used (and maybe set `CUDA_VISIBLE_DEVICES` if there are several available GPU cards).
* To mount a local cache folder `<CACHE_PATH>` (e.g. "`$HOME/.cache`") and avoid downloading models each time,
use ```-v <CACHE_PATH>:/root/.cache```
If you use `MODEL=/opt/model.pt` environment variable, you may want to mount the model file (or folder) with option ```-v <ASR_PATH>:/opt/model.pt```.
* If you want to specifiy a custom alignment model already downloaded in a folder `<WAV2VEC_PATH>`,
you can add option ```-v <WAV2VEC_PATH>:/opt/wav2vec``` and environment variable ```ALIGNMENT_MODEL=/opt/wav2vec```.

**Parameters:**
| Variables | Description | Example |
|:-|:-|:-|
| HOST_SERVING_PORT | Host serving port | 8080 |
| ASR_PATH | Path to the Whisper model on the host machine mounted to /opt/model.pt | /my/path/to/models/medium.pt |
| CACHE_PATH | (Optional) Path to a folder to download wav2vec alignment models when relevant | /home/username/.cache |
| WAV2VEC_PATH | (Optional) Path to a folder to a custom wav2vec alignment model | /my/path/to/models/wav2vec |
| `HOST_SERVING_PORT` | Host serving port | 8080 |
| `<CACHE_PATH>` | (Optional) Path to a folder to download wav2vec alignment models when relevant | /home/username/.cache |
| `<ASR_PATH>` | Path to the Whisper model on the host machine mounted to /opt/model.pt | /my/path/to/models/medium.pt |
| `<WAV2VEC_PATH>` | (Optional) Path to a folder to a custom wav2vec alignment model | /my/path/to/models/wav2vec |

### Micro-service within LinTO-Platform stack
The TASK serving mode connect a celery worker to a message broker.
Expand All @@ -174,25 +230,27 @@ You need a message broker up and running at MY_SERVICE_BROKER.

```bash
docker run --rm \
-v ASR_PATH:/opt/model.pt \
-v SHARED_AUDIO_FOLDER:/opt/audio \
--env-file whisper/.env \
linto-stt-whisper:latest
```

You may also want to mount your cache folder CACHE_PATH (e.g. "~/.cache") ```-v CACHE_PATH:/root/.cache```
in order to avoid downloading models each time.

Also if you want to specifiy a custom alignment model already downloaded in a folder WAV2VEC_PATH,
you can add option ```-v WAV2VEC_PATH:/opt/wav2vec``` and environment variable ```ALIGNMENT_MODEL=/opt/wav2vec```.
You may also want to add specific options:
* To enable GPU capabilities, add ```--gpus all```.
Note that you can use environment variable `DEVICE=cuda` to make sure GPU is used (and maybe set `CUDA_VISIBLE_DEVICES` if there are several available GPU cards).
* To mount a local cache folder `<CACHE_PATH>` (e.g. "`$HOME/.cache`") and avoid downloading models each time,
use ```-v <CACHE_PATH>:/root/.cache```
If you use `MODEL=/opt/model.pt` environment variable, you may want to mount the model file (or folder) with option ```-v <ASR_PATH>:/opt/model.pt```.
* If you want to specifiy a custom alignment model already downloaded in a folder `<WAV2VEC_PATH>`,
you can add option ```-v <WAV2VEC_PATH>:/opt/wav2vec``` and environment variable ```ALIGNMENT_MODEL=/opt/wav2vec```.

**Parameters:**
| Variables | Description | Example |
|:-|:-|:-|
| SHARED_AUDIO_FOLDER | Shared audio folder mounted to /opt/audio | /my/path/to/models/vosk-model |
| ASR_PATH | Path to the Whisper model on the host machine mounted to /opt/model.pt | /my/path/to/models/medium.pt |
| CACHE_PATH | (Optional) Path to a folder to download wav2vec alignment models when relevant | /home/username/.cache |
| WAV2VEC_PATH | (Optional) Path to a folder to a custom wav2vec alignment model | /my/path/to/models/wav2vec |
| `<SHARED_AUDIO_FOLDER>` | Shared audio folder mounted to /opt/audio | /my/path/to/models/vosk-model |
| `<CACHE_PATH>` | (Optional) Path to a folder to download wav2vec alignment models when relevant | /home/username/.cache |
| `<ASR_PATH>` | Path to the Whisper model on the host machine mounted to /opt/model.pt | /my/path/to/models/medium.pt |
| `<WAV2VEC_PATH>` | (Optional) Path to a folder to a custom wav2vec alignment model | /my/path/to/models/wav2vec |


## Usages
Expand Down Expand Up @@ -274,9 +332,10 @@ This project is developped under the AGPLv3 License (see LICENSE).

## Acknowlegment.

* [Faster Whisper](https://github.com/SYSTRAN/faster-whisper)
* [OpenAI Whisper](https://github.com/openai/whisper)
* [Ctranslate2](https://github.com/OpenNMT/CTranslate2)
* [Faster-Whisper](https://github.com/SYSTRAN/faster-whisper)
* [OpenAI Whisper](https://github.com/openai/whisper)
* [Whisper-Timestamped](https://github.com/linto-ai/whisper-timestamped)
* [HuggingFace Transformers](https://github.com/huggingface/transformers)
* [SpeechBrain](https://github.com/speechbrain/speechbrain)
* [TorchAudio](https://github.com/pytorch/audio)
* [HuggingFace Transformers](https://github.com/huggingface/transformers)
4 changes: 4 additions & 0 deletions whisper/RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# 1.0.1
- support of model.safetensors
- ct2/faster_whisper: Information about used precision added in the logs

# 1.0.0
- First build of linto-stt-whisper
- Based on 4.0.5 of linto-stt https://github.com/linto-ai/linto-stt/blob/a54b7b7ac2bc491a1795bb6dfb318a39c8b76d63/RELEASE.md
2 changes: 1 addition & 1 deletion whisper/requirements.ctranslate2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ celery[redis,auth,msgpack]>=4.4.7
flask>=1.1.2
flask-cors>=3.0.10
flask-sock
flask-swagger-ui>=3.36.0
flask-swagger-ui==3.36.0
gevent
gunicorn
lockfile
Expand Down
5 changes: 2 additions & 3 deletions whisper/requirements.torch.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ celery[redis,auth,msgpack]>=4.4.7
flask>=1.1.2
flask-cors>=3.0.10
flask-sock
flask-swagger-ui>=3.36.0
flask-swagger-ui==3.36.0
gevent
gunicorn
lockfile
Expand All @@ -13,7 +13,6 @@ speechbrain
transformers
wavio>=0.0.4
websockets
# openai-whisper
git+https://github.com/linto-ai/whisper-timestamped.git
whisper-timestamped
onnxruntime
torchaudio
Loading

0 comments on commit 399c98d

Please sign in to comment.