Skip to content
daw3rd edited this page Apr 11, 2022 · 8 revisions

Introduction

The goal of the framework is to provide easy access to and use of something called a classifier and the ecosystem of artifacts that surround it. A classifier can be thought of as a black box that can analyze a piece of data and identify or assign one or more classes (i.e. categories, labels, types, etc) to it. This is useful in many domains including manufacturing, security, healthcare, finance, agriculture and retail to name just a few. Classifiers may be implemented by using a set of fixed, user-defined rules that define what category a piece of data would be assigned. For example, if an image has a high amount of red pixels, this might represent a fire situation. Other classifiers, and the ones we are more focused on in this library, automatically learn the set of rules from a set of exemplar (i.e. training) data. This training data is generally a large set of datum instances with each datum being associated with one or more labels (we will use the term label instead of category as that is what is primarily used in this domain). For example, in the case of image recognition, each image would be labeled to identify the set of objects contained in the image and the classifier would learn the mapping of images to objects. After training, the classifier is then able to analyze new data it has not been trained on to develop the set of labels (e.g. objects in the image) that should be associated the given data.

Data is a key concept when considering classifiers. The type (i.e. integer, double, complex, etc.) and shape (i.e. vector, matrix, etc) of data and how it is labeled and stored are all key aspects that must be considered. The framework is designed to allow a generality of type and shape, but implementations focus on arrays of double values (i.e. double[]). This allows for broad support of arbitrary data sets and enables us to focus on segments (i.e. time windows) of scalar (i.e. sound or other sensor) data. Storage and organization is particularly important as the training data sets are ideally very large. Given these large data sizes, streaming and/or distributed computation over the data during training is particularly important if not required. The framework is designed to enable and encourage streaming and distributed computing where possible.

Data Pipeline

Generally speaking, each model/classifier is trained and performs classifications on features extracted from the data stream.
To accommodate audio and other high-sampling-rate data streams, the framework implements a feature extraction and train/classify pipeline as shown below.

The following steps are performed

  1. Normalization of the signal segment to a scaled range of -1 to 1.
  2. If this is a training invocation of the pipeline, then the signals can optionally be augmented to produce 0 or more additional trainining data.
  3. Sub-windows are extracted from the signal. For audio at 44Khz sampling rates, anywhere from 25 to 100 msec is typically used. Note that the framework uses time and not the number of samples to specify the sub-window size.
  4. One or more features are extracted from each sub-window. Features such as Mel-Frequency Filter Bank (MFFB), Mel-Frequency Cepstral Coeficients and LogMel are typically used (see referrences below).
    Each feature typically produces a vector of values.
  5. The features are concatenated together through time to produce a matrix of features vs time. This is often referred to as a spectrogram (or featuregram in the framework's terminology).
  6. The featuregram(s) is/are then optionally processed with a feature processor that operates on the full featuregram. Normalization/scaling and time derivatives are common feature processors (see reference 1 below).
  7. The final feature gram(s) is/are passed to the algorithm for either training or inferencing, depending on the operation. In the case of training, the featuregrames include the original labels attached to the signal segment. In the case of inferencing/classify, the label(s) are produced by the trained model for the unlabeled segment.

The framework currently includes both shallow (KNN and GMM) and deep (CNN, DCASE, MVI) classifiers. An anomaly detection framework is also available.

Classifiers

Classifiers are trained to learn the mapping of featuregrams to label values. Once trained, a classifier can then produce a label value from an unlabeled featuregram. Classifiers can be categorized as either shallow or deep. Shallow models are generally fast to train (less then 1% of the total training data length for 44 khz sample rate) and do not build deep understand/relationships of the features. Deep models generally take anywhere from 25% to 200% of the training data length to train and use much more feature analysis. These are generally neural network based models which can benefit from GPU acceleration. Implementations are discussed in the following sections and details of the default configurations can be seen with the ls-models CLI (e.g. ls-models gmm). Note that some classifiers (non-anomaly detectors) will produce the empty string ("") as the classification value if the model identifies the sample as out-of-distribution relative to the training data. The KNN models support this.

Shallow Models

Shallow models are quick to train and can provide acceptable accuracy.

Nearest Neighbor (*knn)

The KNN accumulated training featuregrams and at inference/classification time finds the training feature gram that is closest to the sample featuregram and applies the associated label. Various distance metrics (Lp, L1 and Euclidian) are available. The lpknn averages featuregrams through time so that a distances are computed between features. This avoids issues with time shifting. The KNN implementations also have the ability to declare outliers using the empty string ("") classification value.

Gaussian Mixture Model (gmm)

Deep Models

Deep models are generally neural network based and require 10x ore more training time. Classification times of current implementation are comparable to the shallow models.

Convolutional Neural Network (cnn)

This is a Deep Learning 4 Java-based algorithm that convolves feature values through the featuregram to learn and classify samples. It can benefit from GPU-acceleration during training.

DCASE (DCASE)

This is an extension of the CNN model that implements the network that won the DCASE 2018 Task 5 competition.

Maximo Visual Inspection (mvi)

This implementation provides a bridge to a Maximo Visual Inspection server to upload featuregrams as images, train a GoogLeNet model and perform classifications. See the MVI page for how to connect to the MVI server.

Anomaly Detectors

Anomaly detection is a special case of classification in that these models produce

  1. An anomaly score between 0 and 1, with a greater value indicating a higher likelihood of being an anomaly.
  2. Optional a state label with values normal and abnormal.

Algorithms may or may not require abnormal training samples. When no abnormal training data is provided, often the model will only be able to produce and anomaly score.

Normal Distribution Anomaly Detector (normal-dist-anomaly)

The Normal Distribution Anomaly Detector is an experimental model that learns a normal distribution for each feature vector element (double value) across a spectrogram.
It can be trained on only normal data or both normal and abnormal data. It uses voting across the vector of normal distributions to declare an anomaly.

References

  1. Mel Frequency Cepstral Coefficient (MFCC) tutorial
  2. Mel-frequency cepstrum
  3. Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What's In-Between