Merge branch 'next'

linto-ai · Nov 3, 2022 · 07cb3cf · 07cb3cf
2 parents 97b23f5 + b92e934
commit 07cb3cf
Show file tree

Hide file tree

Showing 20 changed files with 795 additions and 427 deletions.
diff --git a/.env_default_http b/.env_default_http
@@ -0,0 +1,8 @@
+# SERVING PARAMETERS
+SERVICE_MODE=http
+
+# SERVICE DISCOVERY
+SERVICE_NAME=MY_PUNCTUATION_SERVICE
+
+# CONCURRENCY
+CONCURRENCY=2
diff --git a/.env_default_task b/.env_default_task
@@ -0,0 +1,15 @@
+# SERVING PARAMETERS
+SERVICE_MODE=task
+
+# SERVICE PARAMETERS
+SERVICES_BROKER=redis://192.168.0.1:6379
+BROKER_PASS=password
+
+# SERVICE DISCOVERY
+SERVICE_NAME=my-diarization-service
+LANGUAGE=en-US/fr-FR/*
+QUEUE_NAME=(Optionnal)
+MODEL_INFO=This model does something
+
+# CONCURRENCY
+CONCURRENCY=2
diff --git a/Dockerfile b/Dockerfile
@@ -1,5 +1,5 @@
-FROM python:3.9
-LABEL maintainer="[email protected], [email protected], [email protected]"
+FROM python:3.10
+LABEL maintainer="[email protected], [email protected]"
 
 RUN apt-get update &&\
     apt-get install -y \
@@ -31,6 +31,10 @@ COPY document /usr/src/app/document
 COPY pyBK/diarizationFunctions.py pyBK/diarizationFunctions.py
 COPY docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./
 
+# Grep CURRENT VERSION
+COPY RELEASE.md ./
+RUN export VERSION=$(awk -v RS='' '/#/ {print; exit}' RELEASE.md | head -1 | sed 's/#//' | sed 's/ //')
+
 ENV PYTHONPATH="${PYTHONPATH}:/usr/src/app/diarization"
 
 # Limits on OPENBLAS number of thread prevent SEGFAULT on machine with a large number of cpus

diff --git a/Makefile b/Makefile
@@ -0,0 +1,13 @@
+.DEFAULT_GOAL := help
+
+target_dirs := http_server pyBK diarization celery_app
+
+help:
+	@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'
+
+style:	## update code style.
+	black ${target_dirs}
+	isort ${target_dirs}
+
+lint:	## run pylint linter.
+	pylint ${target_dirs}
diff --git a/README.md b/README.md
@@ -1,19 +1,33 @@
 # LINTO-PLATFORM-DIARIZATION
-LinTO-platform-diarization is the speaker diarization service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack).
+LinTO-platform-diarization is the [LinTO](https://linto.ai/) service for speaker diarization.
 
-LinTO-platform-diarization can either be used as a standalone diarization service or deployed within a micro-services infrastructure using a message broker connector.
+LinTO-platform-diarization can either be used as a standalone diarization service or deployed as a micro-services.
+
+* [Prerequisites](#pre-requisites)
+* [Deploy](#deploy)
+  * [HTTP](#http)
+  * [MicroService](#micro-service)
+* [Usage](#usages)
+  * [HTTP API](#http-api)
+    * [/healthcheck](#healthcheck)
+    * [/diarization](#diarization)
+    * [/docs](#docs)
+  * [Using celery](#using-celery)
+
+* [License](#license)
+***
 
 ## Pre-requisites
 
 ### Docker
 The transcription service requires docker up and running.
 
 ### (micro-service) Service broker and shared folder
-The diarization only entry point in job mode are tasks posted on a message broker. Supported message broker are RabbitMQ, Redis, Amazon SQS.
-On addition, as to prevent large audio from transiting through the message broker, lp-diarization use a shared storage folder.
+The diarization only entry point in job mode are tasks posted on a Redis message broker.
+Futhermore, to prevent large audio from transiting through the message broker, diarization uses a shared storage folder mounted on /opt/audio.
 
-## Deploy linto-platform-diarization
-linto-platform-stt can be deployed three ways:
+## Deploy
+linto-platform-diarization can be deployed:
 * As a standalone diarization service through an HTTP API.
 * As a micro-service connected to a message broker.
 
@@ -22,17 +36,31 @@ linto-platform-stt can be deployed three ways:
 ```bash
 git clone https://github.com/linto-ai/linto-platform-diarization.git
 cd linto-platform-diarization
-git submodule init
-git submodule update
 docker build . -t linto-platform-diarization:latest
 ```
 
-### HTTP API
+### HTTP
+
+**1- Fill the .env**
+```bash
+cp .env_default_http .env
+```
+
+Fill the .env with your values.
+
+**Parameters:**
+| Variables | Description | Example |
+|:-|:-|:-|
+| SERVING_MODE | Specify launch mode | http |
+| CONCURRENCY | Number of HTTP worker* | 1+ |
+
+**2- Run the container**
 
 ```bash
 docker run --rm \
+-v SHARED_FOLDER:/opt/audio \
 -p HOST_SERVING_PORT:80 \
---env SERVICE_MODE=http \
+--env-file .env \
 linto-platform-diarization:latest
 ```
 
@@ -42,37 +70,88 @@ This will run a container providing an http API binded on the host HOST_SERVING_
 | Variables | Description | Example |
 |:-|:-|:-|
 | HOST_SERVING_PORT | Host serving port | 80 |
-| CONCURRENCY | Number of HTTP worker* | 1+ |
 
 > *diarization uses all CPU available, adding workers will share the available CPU thus decreasing processing speed for concurrent requests
 
-### Micro-service within LinTO-Platform stack
->LinTO-platform-diarization can be deployed within the linto-platform-stack through the use of linto-platform-services-manager. Used this way, the container spawn celery worker waiting for diarization task on a message broker.
->LinTO-platform-diarization in task mode is not intended to be launch manually.
->However, if you intent to connect it to your custom message's broker here are the parameters:
+### Using celery
+>LinTO-platform-diarization can be deployed as a micro-service using celery. Used this way, the container spawn celery worker waiting for diarization task on a message broker.
 
-You need a message broker up and running at MY_SERVICE_BROKER.
+You need a message broker up and running at SERVICES_BROKER.
 
+**1- Fill the .env**
 ```bash
-docker run --rm \
--v AM_PATH:/opt/models/AM \
--v LM_PATH:/opt/models/LM \
--v SHARED_AUDIO_FOLDER:/opt/audio \
---env SERVICES_BROKER=MY_SERVICE_BROKER \
---env BROKER_PASS=MY_BROKER_PASS \
---env SERVICE_MODE=task \
---env CONCURRENCY=1 \
-linto-platform-diarization:latest
+cp .env_default_task .env
 ```
 
+Fill the .env with your values.
+
 **Parameters:**
 | Variables | Description | Example |
 |:-|:-|:-|
+| SERVING_MODE | Specify launch mode | task |
 | SERVICES_BROKER | Service broker uri | redis://my_redis_broker:6379 |
 | BROKER_PASS | Service broker password (Leave empty if there is no password) | my_password |
-| CONCURRENCY | Number of celery worker* | 1+ |
+| QUEUE_NAME | (Optionnal) overide the generated queue's name (See Queue name bellow) | my_queue |
+| SERVICE_NAME | Service's name | diarization-ml |
+| LANGUAGE | Language code as a BCP-47 code | en-US or * or languages separated by "\|" |
+| MODEL_INFO | Human readable description of the model | Multilingual diarization model | 
+| CONCURRENCY | Number of worker (1 worker = 1 cpu) | >1 |
+
+**2- Fill the docker-compose.yml**
+
+`#docker-compose.yml`
+```yaml
+version: '3.7'
+
+services:
+  punctuation-service:
+    image: linto-platform-diarization:latest
+    volumes:
+      - /path/to/shared/folder:/opt/audio
+    env_file: .env
+    deploy:
+      replicas: 1
+    networks:
+      - your-net
+
+networks:
+  your-net:
+    external: true
+```
+
+**3- Run with docker compose**
+
+```bash
+docker stack deploy --resolve-image always --compose-file docker-compose.yml your_stack
+```
+
+**Queue name:**
+
+By default the service queue name is generated using SERVICE_NAME and LANGUAGE: `diarization_{LANGUAGE}_{SERVICE_NAME}`.
+
+The queue name can be overided using the QUEUE_NAME env variable. 
+
+**Service discovery:**
+
+As a micro-service, the instance will register itself in the service registry for discovery. The service information are stored as a JSON object in redis's db0 under the id `service:{HOST_NAME}`.
+
+The following information are registered:
+
+```json
+{
+  "service_name": $SERVICE_NAME,
+  "host_name": $HOST_NAME,
+  "service_type": "diarization",
+  "service_language": $LANGUAGE,
+  "queue_name": $QUEUE_NAME,
+  "version": "1.2.0", # This repository's version
+  "info": "Multilingual diarization model",
+  "last_alive": 65478213,
+  "concurrency": 1
+}
+```
+
 
-> *diarization uses all CPU available, adding workers will share the available CPU thus decreasing processing speed for concurrent requests
 
 ## Usages
 
@@ -92,9 +171,9 @@ Diarization API
 
 * Method: POST
 * Response content: application/json
-* File: An Wave file
-* spk_number: (integer - optional) Number of speakers. If empty, diarization will guess.
-* max_speaker: (interger - optional) Max number of speakers if spk_number is empty. 
+* File: A Wave file
+* spk_number: (integer - optional) Number of speakers. If empty, diarization will clusterize automatically.
+* max_speaker: (integer - optional) Max number of speakers if spk_number is unknown. 
 
 Return a json object when using structured as followed:
 ```json
@@ -116,7 +195,7 @@ The /docs route offers a OpenAPI/swagger interface.
 ### Through the message broker
 
 STT-Worker accepts requests with the following arguments:
-```file_path: str, with_metadata: bool```
+```file_path: str, speaker_count: int (None), max_speaker: int (None)```
 
 * <ins>file_path</ins>: (str) Is the location of the file within the shared_folder. /.../SHARED_FOLDER/{file_path}
 * <ins>speaker_count</ins>: (int default None) Fixed number of speakers.

diff --git a/RELEASE.md b/RELEASE.md
@@ -1,3 +1,21 @@
+# 1.1.2
+- Added service registration.
+- Updated healthcheck to add heartbeat.
+- Added possibility to overide generated queue name.
+# 1.1.1
+- Fixed: silences (and short occurrences <1 sec between silences) occurring inside a speaker turn were postponed at the end of the speaker turn (and could be arbitrarily assigned to next speaker)
+- Fixed: make diarization deterministic (random seed is fixed)
+- Tune length of short occurrences to consider as silences (0.3 sec)
+
+# 1.1.0
+- Changed: loading audio file by AudioSegment toolbox. 
+- Changed: mfcc are extracted by python_speech_features toolbox.
+- Fixed windowRate =< maximumKBMWindowRate.
+- Likelihood table is only calculated for the top five gaussian, computation time is reduced.
+- Similarity matrix is calculated by Binary keys and cumulative vectors
+- Removed: unused AHC.
+- Code formated to pep8
+
 # 1.0.3
 - Fixed: diarization failing on short audio when n_speaker > 1
 - Fixed (TBT): diarization returning segfault on machine with a lot of CPU

diff --git a/celery_app/celeryapp.py b/celery_app/celeryapp.py
@@ -1,24 +1,24 @@
 import os
+
 from celery import Celery
 
 from diarization import logger
 
-celery = Celery(__name__, include=['celery_app.tasks'])
+celery = Celery(__name__, include=["celery_app.tasks"])
 service_name = os.environ.get("SERVICE_NAME")
 broker_url = os.environ.get("SERVICES_BROKER")
 if os.environ.get("BROKER_PASS", False):
-    components = broker_url.split('//')
+    components = broker_url.split("//")
     broker_url = f'{components[0]}//:{os.environ.get("BROKER_PASS")}@{components[1]}'
 celery.conf.broker_url = "{}/0".format(broker_url)
 celery.conf.result_backend = "{}/1".format(broker_url)
-celery.conf.update(
-    result_expires=3600,
-    task_acks_late=True,
-    task_track_started=True)
+celery.conf.update(result_expires=3600, task_acks_late=True, task_track_started=True)
 
 # Queues
 celery.conf.update(
-    {'task_routes': {
-        'diarization_task': {'queue': 'diarization'}, }
-     }
+    {
+        "task_routes": {
+            "diarization_task": {"queue": "diarization"},
+        }
+    }
 )