Skip to content

Feature extractors

Tim Sharii edited this page Sep 6, 2021 · 28 revisions
  1. Feature extractors
  2. Best practices
  3. Pre-processing
  4. Post-processing
  5. Serialization
  6. Time-domain feature extractor
  7. Spectral feature extractor
  8. Pitch
  9. MPEG-7 feature extractor
  10. Filterbank extractor
  11. Chroma feature extractor
  12. MFCC/PNCC
  13. LPC/LPCC
  14. PLP
  15. Wavelet extractor
  16. AMS
  17. Creating your own feature extractor

Feature extractors

All feature extractors are inherited from abstract class FeatureExtractor and have two methods for computations (including overloaded versions): 1) ComputeFrom() that takes float[] array or DiscreteSignal as input (and optionally starting and ending positions for analysis) and 2) ProcessFrame() that takes float[] array representing one frame as input and output float[] array for computed features in this frame. Extractors can and must be reused: create extractor object once and call ComputeFrom() (in most cases) or ProcessFrame() method every time there's a new portion of data and new features should be calculated.

Properties:

  • public int FeatureCount
  • abstract List<string> FeatureDescriptions
  • virtual List<string> DeltaFeatureDescriptions
  • virtual List<string> DeltaDeltaFeatureDescriptions
  • double FrameDuration (in seconds)
  • double HopDuration (in seconds)
  • int FrameSize (in samples)
  • int HopSize (in samples)
  • int SamplingRate

Main methods:

  • abstract void ProcessFrame(float[] frame, float[] features)
  • virtual List<float[]> ComputeFrom(float[] samples)
  • virtual List<float[]> ParallelComputeFrom(float[] samples)
  • virtual void Reset()

Feature extractors must be configured before usage. FeatureExtractorOptions class and its subclasses are responsible for the configuration. These classes simply contain public serializable properties for each parameter of extractor and virtual method Errors() for obtaining the list of error strings describing particular validation problem. Base feature extractor accepts options object as parameter during construction and throws exception if options are invalid, with all errors merged into exception message.

Example:

var mfccOptions = new MfccOptions
{
    SamplingRate = 16000,
    FeatureCount = 13,
    FrameDuration = 0.025/*sec*/,
    HopDuration = 0.010/*sec*/,
    PreEmphasis = 0.97,
    Window = WindowTypes.Hamming
};

var mfccExtractor = new MfccExtractor(mfccOptions);
List<float[]> mfccVectors = mfccExtractor.ComputeFrom(signal);


// process only one frame:

var features = new float[13];
mfccExtractor.ProcessFrame(block, features);


// let's say we got some more samples from external source;
// process them with same extractor
// (for demo, let's also ignore first 100 samples):

var newVectors = mfccExtractor.ComputeFrom(samples, 100, samples.Length);

// properties:

var count = mfccExtractor.FeatureCount;          // 13
var names = mfccExtractor.FeatureDescriptions;   // { "mfcc0", "mfcc1", ..., "mfcc12" }
var frameDuration = mfccExtractor.FrameDuration; // 0.025 (seconds)
var hopDuration = mfccExtractor.HopDuration;     // 0.010 (seconds)
var sr = mfccExtractor.SamplingRate;             // 16000 (Hz)
var frameSize = mfccExtractor.FrameSize;         // 400 = 0.025 * 16000
var hopSize = mfccExtractor.HopSize;             // 160 = 0.010 * 16000

All base properties are enumerated below:

  1. SamplingRate (must be specified explicitly, must be > 0)
  2. FrameDuration (0.025 seconds by default, must be > 0)
  3. HopDuration (0.010 seconds by default, must be > 0)
  4. FrameSize (number of samples, auto-computed from frame duration, if not specified explicitly)
  5. HopSize (number of samples, auto-computed from hop duration, if not specified explicitly)
  6. Window (WindowTypes.Rectangular by default (i.e. there's no windowing of frames))
  7. PreEmphasis (0 by default (i.e. no pre-emphasis filter is applied))
  8. FeatureCount (0 by default, since there may be different ways to derive number of features in different feature extractors).

FrameSize-FrameDuration and HopSize-HopDuration are fully interdependent parameters. FrameSize and HopSize have priority over durations. If they are set explicitly then (Frame|Hop)Duration = (Frame|Hop)Size / SamplingRate. Otherwise, FrameSize and HopSize will be derived from FrameDuration and HopDuration.

Current configuration can be saved to JSON file and then loaded from file to memory back. This is especially useful for MFCC/PNCC/PLP extractors that allow customizing lot of parameters. You may prepare several different configs for permanent usage. Different extractors accept different subclasses of FeatureExtractorOptions class, but one subtype of options can be cast to another subtype. In this case identical properties will be copied, other properties will simply be ignored.

Example:

// serialize current MFCC config to JSON file:

using (var config = new FileStream("file.json", FileMode.Create))
{
    config.SaveOptions(mfccOptions);
}

// open config from JSON file (cast to PLP config):
PlpOptions options;
using (var config = new FileStream("file.json", FileMode.Open))
{
    // identical fields will be copied, other fields will be ignored
    options = config.LoadOptions<PlpOptions>();
}

// cast (identical fields will be copied, other fields will be ignored)
options = mfccOptions.Cast<MfccOptions, PlpOptions>();

Feature extractors return collection of feature vectors. Prior to ver.0.9.3 it was of type List<FeatureVector>. Each feature vector contained time position (in seconds) and array of features (double TimePosition and float[] Features). As of ver.0.9.3 each feature vector is just an array of floats. Time positions (time markers) can be easily obtained if necessary like this:

var mfccs = mfccExtractor.ComputeFrom(data);
List<double> timeMarkers = mfccExtractor.TimeMarkers(mfccs.Count);

// continue computations on new data:

var lastMarker = timeMarkers.Last();
var nextMarker = lastMarker + mfccExtractor.HopDuration;

mfccs = mfccExtractor.ComputeFrom(newData);
timeMarkers = mfccExtractor.TimeMarkers(mfccs.Count, startFrom: nextMarker);

Also, there's an extension method Statistics() returning Dictionary with keys:

var stats = mfccs[0].Statistics();

var min = stats["min"];
var max = stats["max"];
var avg = stats["mean"];
var variance = stats["var"];

Feature extractors can be parallelizable, i.e. they can create their internal clones which will simultaneously process different parts of signal. In this case property IsParallelizable will return true. All available extractors in NWaves are parallelizable (except PnccExtractor, SpnccExtractor and AmsExtractor). For example:

var lpcExtractor = new LpcExtractor(lpcOptions);
var lpcVectors = lpcExtractor.ParallelComputeFrom(signal);

ParallelComputeFrom method has one additional parameter parallelThreads. By default, parallelThreads = 0 which means that all CPU cores will be involved in computations. Under the hood, ParallelComputeFrom method creates (N-1) copies of the current extractor, splits data into N chunks and each copy processes its own chunk (N = number of parallel threads). Thus, the method was introduced mainly for processing long signals; for small arrays it's quite an overhead.

Important note

Feature extractors are not thread-safe! ParallelComputeFrom() method takes this into account (so it's safe), but if you're going to do your own parallel computations don't use the same extractor instance in different threads

Best practices

In case of offline processing everything is simple: just call ComputeFrom() or ParallelComputeFrom() method. Say, you need to extract MFCC features from the set of files and save them to csv files. You can do it like this:

var i = 1;
foreach (var filename in filenames)
{
    WaveFile waveFile;

    using (var stream = new FileStream(filename, FileMode.Open))
    {
         waveFile = new WaveFile(stream);
    }

    var signal = waveFile[Channels.Left];

    // if files are big, let's parallelize computations:
    var mfccVectors = mfccExtractor.ParallelComputeFrom(signal);

    using (var csvFile = new FileStream($"{i}.csv", FileMode.Create))
    {
        var serializer = new CsvFeatureSerializer(mfccVectors, header);
        await serializer.SerializeAsync(csvFile);
    }

    i++;
}

In case of online processing different options arise depending on how exactly audiodata gets to your program. You can still use ComputeFrom() method. If portions of incoming data are relatively small or if it makes sense to process each frame separately then call ProcessFrame() method. Also, this method ignores the time position of a frame, since it returns only array of computed features.

However, the most recommended approach for online feature extraction is using the wrapper around FeatureExtractor called OnlineFeatureExtractor (new in ver.0.9.5):

var mfccExtractor = new OnlineFeatureExtractor(new MfccExtractor(options));

//
// In general, there are 2 ways we can use OnlineFeatureExtractor:
//

// 1) accumulate all feature vectors during processing (if we need to store them all)

var mfccVectors = new List<float[]>();

//while (there's data in buffers)
{
     // function responsible for getting new portion of online data
     var block = /*GetDataFromBuffers()*/;

     var newVectors = mfccExtractor.ComputeFrom(block);

     // block can have various length, and online feature extractor 
     // will handle all possible cases nicely

     // including the case when the block is very small
     // and not suficient to compute any new feature vector.

     // in this case check for null:

     if (newVectors != null)
     {
         mfccVectors.AddRange(newVectors);
     }
}

mfccExtractor.Reset();


// 2) don't accumulate feature vectors during processing 
//    (we reuse the same memory for output feature vectors,
//    do something with them and forget about them)

var count = mfccExtractor.VectorCountFromSeconds(1/*sec*/);

// prepare memory for output temporary vectors:

var tempVectors = new List<float[]>(count);
for (var i = 0; i < count; i++)
{
    tempVectors.Add(new float[options.FeatureCount]);
}


//while (there's data in buffers)
{
     // function responsible for getting new portion of online data
     var block = /*GetDataFromBuffers()*/;

     // fill tempVectors
     int computedCount = mfccExtractor.ComputeFrom(block, tempVectors);

     // block can have various length, and online feature extractor 
     // will handle all possible cases nicely

     // including the case when the block is very small
     // and not suficient to compute any new feature vector.

     if (computedCount > 0)
     {
         // do something with tempVectors[0] .. tempVectors[compuitedCount - 1]
     }
}

mfccExtractor.Reset();

As we can see in the code above, there are two methods with name ComputeFrom:

  • List<float[]> ComputeFrom(float[] data)
  • int ComputeFrom(float[] data, List<float[]> vectors)

The former version allocates memory for each new list of feature vectors (or returns null if there was not enough data to compute even one feature vector) at each new processing stage. The latter version fills pre-allocated list of feature vectors and returns the number of filled feature vectors at the current processing stage. This approach can be used in online feature extraction much more frequently, since it's memory efficient and often there's no need to accumulate/save all vectors (work with just the current ones).

Another usage example

The constructor of OnlineFeatureExtractor has two additional parameters:

  • ignoreLastSamples (false, by default)
  • maxDataSize (0, by default)

In order to understand the first parameter, we should keep in mind that in general not all data will be processed by the extractor at each stage. Sometimes it's not a problem - ending part can be simply skipped (lost). In this scenario set ignoreLastSamples = true. But if every sample should be accounted for then the online extractor will have to save ending part to temporary array and then add it to the beginning of a new portion of data manually. In this scenario (default) set ignoreLastSamples = false. Let's illustrate:

frameSize: 12
hopSize:   6

signal: ssssssssssssssssssssssssxxxx    (portion 1)

frames: ssssssssssss                    (frame 1 is processed)
              ssssssssssss              (frame 2 is processed)
                    ssssssssssss        (frame 3 is processed)
                          ssssssxxxx??  <- this frame will not be processed
                                           because it needs 2 non-existing samples

So there are 4 samples (xxxx) which were ignored in portion 1 of data.

signal: ssssssssssssssssssssssssssss     (portion 2)
 
Now we can process this new portion as it is or prepend it with xxxx:

signal: xxxxssssssssssssssssssssssssssss (portion 2)

Parameter maxDataSize defines the size of an intermediate buffer that accumulates online portions of data. By default it's 0, i.e. the size will be auto-computed as the number of samples corresponding to one second of the signal (given the sampling rate specified in Options). If the number of samples in some portion of online data exceeds this parameter value, an exception will be thrown during processing.

Anytime you can change the size:

onlineExtractor.EnsureSize(50000/*samples*/);
onlineExtractor.EnsureSizeFromSeconds(1.5/*sec*/);

Important note

If you're going to manually do online processing calling ProcessFrame() method (not recommended approach), keep in mind that extractors don't do pre-emphasis and windowing in ProcessFrame() method even if you've specified corresponding parameters in extractor's constructor! Pre-processing of each frame (if it's needed) in this case must be done manually:

void GotNewAudioData(float[] data)
{
    preFilter.Process(data, data);
    data.ApplyWindow(window);

    mfccExtractor.ProcessFrame(data, features);
    // ...
}

var window = Window.OfType(WindowTypes.Hamming, frameSize);
var preFilter = new PreEmphasisFilter(0.97);

Just prefer OnlineFeatureExtractor for online processing. It handles all nuances nicely.

Note. OnlineFeatureExtractor doesn't handle case when HopSize > FrameSize (since this is quite weird and unusual setting for online feature extraction).

Also, some extractors have state (PNCC/SPNCC and PLP with RASTA coeff > 0), so don't forget to reset them when you start processing new sequence of data. All other extractors are stateless so no need to worry about resetting them.

Pre-processing

In speech processing, pre-emphasis filters are often applied to signal before main processing.

There are 3 options to perform pre-emphasis filtering:

  1. set pre-emphasis coefficient in constructor of a feature extractor
  2. apply filter before processing and process filtered signal
  3. filter signal in-place and process it

The first option is slightly slower, however it won't allocate extra memory and it won't mutate input signal (so, perhaps, it should be the choice by default). If preserving of the input signal is not required, then the third option is the best. If the input signal must be preserved and extra memory is not an issue, then the second approach is preferred (it'll be faster).

// option 1:

var opts = new MfccOptions
{
    SamplingRate = signal.SamplingRate,
    FeatureCount = 13,
    PreEmphasis = 0.95
};
var mfccExtractor = new MfccExtractor(opts);
var mfccVectors = mfccExtractor.ComputeFrom(signal);

// option 2:
// ApplyTo() will create new signal (allocate new memory)

opts.PreEmphasis = 0;
mfccExtractor = new MfccExtractor(opts);
var pre = new PreEmphasisFilter(0.95);
var filtered = pre.ApplyTo(signal);
mfccVectors = mfccExtractor.ComputeFrom(filtered);

// option 3:
// process array or DiscreteSignal samples in-place:

for (var i = 0; i < signal.Length; i++)
{
    signal[i] = pre.Process(signal[i]);
}
// or simply:
// pre.Process(signal.Samples, signal.Samples);

mfccVectors = mfccExtractor.ComputeFrom(signal);

Post-processing

  • Mean subtraction
  • Variance normalization
  • Adding deltas and delta-deltas to existing feature vector
  • Joining (merging feature vectors into one longer vector)
var mfccs = mfccExtractor.ComputeFrom(signal);

FeaturePostProcessing.NormalizeMean(mfccs);
FeaturePostProcessing.NormalizeVariance(mfccs, bias: 0);

FeaturePostProcessing.AddDeltas(mfccs, includeDeltaDelta: false);

var totalVectors = FeaturePostProcessing.Join(mfccs, lpcs, lpccs);

The bias parameter in NormalizeVariance() method is by default equal to 1 (so estimate of variance is unbiased). This parameter is present in formula:

img

Method AddDeltas() extends each feature vector in the list. Deltas are computed according to formula:

img

N can be passed as parameter. By default, N=2 so the formula reduces to:

img

As can be seen, we need to account for marginal cases (at the beginning and at the end of the list). Method AddDeltas() by default adds two zero vectors at the beginning and two zero vectors at the end of the list (and that's perfectly fine in most cases). You can also prepend and append your own collections of feature vectors to specify marginal 'previous' and 'next' sets (there must be at least two vectors in each set; method will unite all lists and calculate deltas in united list starting from the third vector and ending with the third from the end):

List<float[]> previous = new List<float[]>(); // for N previous vectors
List<float[]> next = new List<float[]>();     // for N vectors after the last one

// ... fill 'previous' and 'next' with values ...

FeaturePostProcessing.AddDeltas(mfccs, previous, next);

Serialization

var extractor = new MfccExtractor(opts);
var mfccs = extractor.ComputeFrom(signal);

// simplest format:

using (var csvFile = new FileStream("mfccs.csv", FileMode.Create))
{
    var serializer = new CsvFeatureSerializer(mfccs);
    await serializer.SerializeAsync(csvFile);
}


// more detailed format:

var timeMarkers = extractor.TimeMarkers(mfccs.Count);
var header = extractor.FeatureDescriptions;

using (var csvFile = new FileStream("mfccs.csv", FileMode.Create))
{
    var serializer = new CsvFeatureSerializer(mfccs, timeMarkers, header);
    await serializer.SerializeAsync(csvFile);
}

Time-domain feature extractor

TimeDomainFeaturesExtractor class is the first representative of the multi-extractors family:

  • TimeDomainFeaturesExtractor
  • SpectralFeaturesExtractor
  • Mpeg7SpectralFeaturesExtractor

These extractors compute several features at once in each frame using different formulae/routines. They accept the string containing the list of feature names enumerated with any separator (',', '+', '-', ';', ':'). If the string "all" or "full" is specified then multi-extractor will compute all pre-defined features (returned by FeatureSet public property).

For configuring multi-extractors MultiFeatureOptions class is used. It adds the following properties to base options (all of them are optional, they may be used in one particular extractor and ignored in another extractor):

  1. string FeatureList (by default, "all");
  2. int FftSize (0 by default, i.e. it will be auto-derived)
  3. float[] Frequencies (null by default, i.e. spectral frequencies will be derived automatically)
  4. (double, double, double)[] FrequencyBands (null by default, i.e. bands will be auto-computed)
  5. Dictionary<string, object> Parameters (specific for particular extractor)

TimeDomainFeaturesExtractor, as the name suggests, computes time-domain features, such as: energy, RMS, ZCR and entropy (all these methods are contained in DiscreteSignal class):

var opts = new MultiFeatureOptions
{
    SamplingRate = signal.SamplingRate,
    FrameDuration = 0.032,
    HopDuration = 0.02
};
var tdExtractor = new TimeDomainFeaturesExtractor(opts);
var tdVectors = tdExtractor.ComputeFrom(signal);

// compute only RMS and ZCR at each step:

opts.FeatureList = "rms, zcr";

tdExtractor = new TimeDomainFeaturesExtractor(opts);
tdVectors = tdExtractor.ComputeFrom(signal);

// let's examine what is available:

var names = tdExtractor.FeatureSet;  // { "energy, rms, zcr, entropy" }

Recognized keywords are:

  • Energy: "e", "en", "energy"
  • RMS: "rms"
  • ZCR: "zcr", "zero-crossing-rate"
  • Entropy: "entropy"

Keywords are case-insensitive.

You can also add your own feature with corresponding function for its calculation. This function must accept three parameters: 1) signal, 2) start position, 3) end position (exclusive). Code example:

var opts = new MultiFeatureOptions
{
    SamplingRate = signal.SamplingRate,
    FeatureList = "rms"
};
var tdExtractor = new TimeDomainFeaturesExtractor(opts);

// let's add two features:
// 1) "positives": percentage of samples with positive values
// 2) "avgStartEnd": just the average of start and end sample

tdExtractor.AddFeature("positives", CountPositives);
tdExtractor.AddFeature("avgStartEnd", (s, start, end) => (s[start] + s[end - 1]) / 2);

var count = tdExtractor.FeatureCount;        // 3
var names = tdExtractor.FeatureDescriptions; // { "rms", "positives", "avgStartEnd" }

// ...
float CountPositives(DiscreteSignal signal, int start, int end)
{
    var count = 0;
    for (var i = start; i < end; i++)
    {
        if (signal[i] >= 0) count++;
    }
    return (float)count / (end - start);
}

Spectral feature extractor

SpectralFeaturesExtractor computes spectral features, such as: centroid, spread, flatness, etc. All these methods are taken from static class Spectral and can be calculated separately for one particular spectrum, without creating any extractor:

using NWaves.Features;

// prepare array of frequencies
// (simply spectral frequencies: 0, resolution, 2*resolution, ...)

var resolution = (float)samplingRate / fftSize;

var frequencies = Enumerable.Range(0, fftSize / 2 + 1)
                            .Select(f => f * resolution)
                            .ToArray();

var spectrum = new Fft(fftSize).MagnitudeSpectrum(signal);

// compute various spectral features
// (spectrum has length fftSize/2+1)

var centroid = Spectral.Centroid(spectrum, frequencies);
var spread = Spectral.Spread(spectrum, frequencies);
var flatness = Spectral.Flatness(spectrum, minLevel);
var noiseness = Spectral.Noiseness(spectrum, frequencies, noiseFreq);
var rolloff = Spectral.Rolloff(spectrum, frequencies, rolloffPercent);
var crest = Spectral.Crest(spectrum);
var decrease = Spectral.Decrease(spectrum);
var entropy = Spectral.Entropy(spectrum);
var contrast1 = Spectral.Contrast(spectrum, frequencies, 1);
//...
var contrast6 = Spectral.Contrast(spectrum, frequencies, 6);

Usually, the spectral frequencies are involved in calculations, but you can specify any frequency array you want:

var freqs = new float[] { 200, 300, 500, 800, 1200, 1600, 2500, 5000/*Hz*/ };

var centroid = Spectral.Centroid(spectrum, freqs);
var spread = Spectral.Spread(spectrum, freqs);

SpectralFeaturesExtractor usage example:

var opts = new MultiFeatureOptions
{
    SamplingRate = signal.SamplingRate,
    Frequencies = freqs
};
var extractor = new SpectralFeaturesExtractor(opts);
var vectors = extractor.ComputeFrom(signal);

// let's examine what is available:

var names = extractor.FeatureSet;
// { "centroid, spread, flatness, noiseness, rolloff, crest, entropy, decrease, c1+c2+c3+c4+c5+c6" }

Recognized keywords are:

  • Spectral Centroid: "sc", "centroid"
  • Spectral Spread: "ss", "spread"
  • Spectral Flatness: "sfm", "flatness"
  • Spectral Noiseness: "sn", "noiseness"
  • Spectral Rolloff: "rolloff"
  • Spectral Crest: "crest"
  • Spectral Entropy: "ent", "entropy"
  • Spectral Decrease: "sd", "decrease"
  • Spectral Contrast in band 1,2,...: "c1", "c2", ...

Keywords are case-insensitive.

You can also add your own spectral feature with corresponding function for its calculation. This function must accept two parameters: 1) array of samples, 2) array of frequencies. Code example:

opts.FeatureList = "sc";

var extractor = new SpectralFeaturesExtractor(opts);

// let's add new feature: relative position of the first peak
extractor.AddFeature("peakPos", FirstPeakPosition);

// ...
// in our case 'frequencies' array will be ignored

float FirstPeakPosition(float[] spectrum, float[] frequencies)
{
    for (var i = 2; i < spectrum.Length - 2; i++)
    {
        if (spectrum[i] > spectrum[i - 2] && spectrum[i] > spectrum[i - 1] && 
            spectrum[i] > spectrum[i + 2] && spectrum[i] > spectrum[i + 1]) 
        {
            return (float) i / spectrum.Length;
        }
    }
    return 0;
}

Dictionary of parameters may contain following keys:

  • "noiseFrequency" (used for computing Spectral.Noiseness; by default 3000)
  • "rolloffPercent" (used for computing Spectral.Rolloff; by default 0.85f)

Note. Spectral noiseness is unconventional parameter and is calculated as a ratio of spectral energy in high-frequency region and total spectral energy, by simple formula:

img

Pitch

There are several pitch detection (estimation) techniques. They broadly fall into two groups:

  • Time-domain techniques (auto-correlation, YIN, ZCR-based)
  • Frequency-domain techniques (Harmonic product/sum spectrum, cepstrum)

Static class Pitch provides the following methods for pitch evaluation (estimation):

  • FromAutoCorrelation
  • FromYin
  • FromZeroCrossingsSchmitt
  • FromSpectralPeaks
  • FromHps
  • FromHss
  • FromCepstrum

All of these methods have the overloaded versions that accept either DiscreteSignal or float[] as the first parameter. In case of time-domain estimators float[] array represents array of signal samples, in case of frequency-domain estimators - spectrum array.

All methods, except FromZeroCrossingsSchmitt, accept the lower and upper frequency of the range where to find pitch (by default they are 80 and 400 Hz, respectively):

var pitch = Pitch.FromAutocorrelation(signal, start, end, 100, 500);

Method, based on the number of zero crossings and Schmitt trigger, is good for estimating pitch in periodic sounds (e.g. guitar string):

var pitch = Pitch.FromZeroCrossingsSchmitt(signal, start, end, -0.2f, 0.2f);

Last two optional parameters are optional thresholds for Schmitt trigger. You can try tweaking them or just don't set them - the thresholds will be estimated by default from the signal.

YIN algorithm is implemented:

De Cheveigné, A., Kawahara, H. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4). - 2002.

var pitch = Pitch.FromYin(signal, start, end, 80, 400, 0.2f);

Last parameter is YIN-specific "threshold for cumulative mean-difference function". By default it's 0.25f.

The first frequency-domain-based method is FromSpectralPeaks. It's very simple and surprisingly quite OK technique. It simply finds the position of the first peak (the value is peak if it's greater than two left and two right neighbours) in spectrum:

var pitch = Pitch.FromSpectralPeaks(spectrum, sr, 80, 500 /*Hz*/);
pitch = Pitch.FromSpectralPeaks(signal, start, end, 80, 500 /*Hz*/);

The first overloaded method accepts spectrum array. The second version accepts the signal and (optionally) the size of FFT (since it will compute spectrum). If you're not sure ignore the fftSize parameter - the method will derive it automatically.

The same goes for two similar methods: Harmonic Sum Spectrum (HSS) and Harmonic Product Spectrum (HPS):

var pitch = Pitch.FromHss(spectrum, sr, 80, 500 /*Hz*/);
pitch = Pitch.FromHss(signal, start, end, 80, 500 /*Hz*/);

pitch = Pitch.FromHps(spectrum, sr, 80, 500 /*Hz*/);
pitch = Pitch.FromHps(signal, start, end, 80, 500 /*Hz*/);

Example of how we can add any of these methods to SpectralExtractor algorithms:

var extractor = new SpectralFeaturesExtractor(opts);
extractor.AddFeature("pitch_hss", (spectrum, fs) => Pitch.FromHss(spectrum, signal.SamplingRate, 80, 500));

var vectors = extractor.ComputeFrom(signal);

Pitch extractor

There's also PitchExtractor class inherited from FeatureExtractor. Currently, it's based only on auto-correlation technique (since this method is universal and simply works more or less well). Each feature vector in the list contains one component: "pitch".

var opts = new PitchOptions
{
    SamplingRate = signal.SamplingRate,
    FrameDuration = 0.04,
    LowFrequency = 80/*Hz*/,
    HighFrequency = 500/*Hz*/
};
var extractor = new PitchExtractor(opts);
var pitches = extractor.ComputeFrom(signal);

Two additional properties in PitchOptions class are lower and upper frequency of the range where to find the pitch.

If you need a pitch extractor based on other time-domain method (YIN or ZcrSchmitt) then TimeDomainFeatureExtractor class can be used. Likewise, if you need a pitch extractor based on a certain spectral method (HSS or HPS), then SpectralDomainFeatureExtractor can be used. Example:

var opts = new MultiFeatureOptions
{
    SamplingRate = singal.SamplingRate,
    FeatureList = "en"
};
var extractor = new TimeDomainFeaturesExtractor(opts);
extractor.AddFeature("yin", (s, start, end) => Pitch.FromYin(s, start, end, 80, 500));

var pitches = extractor.ComputeFrom(signal);

MPEG-7 feature extractor

Mpeg7SpectralFeaturesExtractor follows MPEG-7 recommendations to evaluate the following features:

  • Spectral features (MPEG-7)
  • Harmonic features
  • Perceptual features

It's a flexible extractor that allows varying almost everything. The difference between Mpeg7SpectralFeaturesExtractor and SpectralFeaturesExtractor is that former calculates spectral features from total energy in frequency bands while latter analyzes signal energy at particular frequencies (spectral bins). Also, optionally it allows computing harmonic features in addition to spectral features.

Hence, constructors of these two classes are basically the same, except that MPEG-7 extractor accepts array of frequency bands (double, double, double)[] instead of array of frequencies. By default, 6 octave bands are used according to MPEG-7 standard.

Recognized keywords for spectral and perceptual features are:

  • Spectral Centroid: "sc", "centroid"
  • Spectral Spread: "ss", "spread"
  • Spectral Flatness: "sfm", "flatness"
  • Spectral Noiseness: "sn", "noiseness"
  • Spectral Rolloff: "rolloff"
  • Spectral Crest: "crest"
  • Spectral Entropy: "ent", "entropy"
  • Spectral Decrease: "sd", "decrease"
  • Perceptual Loudness: "loudness"
  • Perceptual Sharpness: "sharpness"

Recognized keywords for harmonic features are:

  • Harmonic centroid: "hc", "hcentroid"
  • Harmonic Spread: "hs", "hspread"
  • Inharmonicity: "inh", "inharmonicity"
  • Odd-to-Even Ratio: "oer", "oddevenratio"
  • Tristimulus1: "t1"
  • Tristimulus2: "t2"
  • Tristimulus3: "t3"

Harmonic features can be calculated separately, using the corresponding methods of static class Harmonic.

Harmonic features rely on pitch and harmonic peaks of the spectrum. Pitch track (float[] array of pitches) can be precomputed by PitchExtractor. In this case you can call method extractor.SetPitchTrack(pitchTrack) so that extractor will use these pre-computed values at each processing step. Note, if you set the pitch track in extractor, then the extractor will stop being parallelizable. The second option is to calculate pitch at each step in MPEG-7 extractor itself. By default, simplest and fastest method Pitch.FromSpectralPeaks() is used for pitch evaluation. But you can set your own pitch estimating function.

Also, the method for harmonic peaks detection must be set. It has quite long signature and by default, it simply calls static method Harmonic.Peaks(float[] spectrum, int[] peaks, float[] peakFrequencies, int samplingRate, float pitch = -1). And once again, you can set your own method that should fill arrays of peak indices and peak frequencies.

Phew! That was not easy. Let's take a look at the example:

var mpeg7Extractor = new Mpeg7SpectralFeaturesExtractor(opts);
mpeg7Extractor.IncludeHarmonicFeatures("all", 12, GetPitch, GetPeaks, 80, 500 /*Hz*/);

// ...

float GetPitch(float[] spectrum)
{
    return Pitch.FromHps(spectrum, signal.SamplingRate, 80, 600/*Hz*/);

    // or any other user-defined algorithm
}

void GetPeaks(float[] spectrum, int[] peaks, float[] peakFrequencies, int samplingRate, float pitch = -1)
{
    if (pitch < 0)
    {
        pitch = GetPitch(spectrum);
    }
    // fill peaks array
    // fill peakFrequencies array
}

So, IncludeHarmonicFeatures() method allows setting functions for pitch and spectral peaks detection. The second parameter (in this case 12) is the number of harmonic peaks to evaluate.

In the following example spectral and perceptual features are calculated in 12 mel frequency bands; harmonic features are included as well and calculated based on pitch values precomputed with PitchExtractor (so the things related to pitch estimation are much simpler in this case):

var sr = signal.SamplingRate;

// 12 overlapping mel bands in frequency range [0, 4200] Hz
var melBands = FilterBanks.MelBands(12, sr, 0, 4200, true);

var pitchOptions = new PitchOptions
{
    SamplingRate = sr,
    FrameDuration = 0.04,
    HopDuration = 0.015,
    HighFrequency = 700
};
var pitchExtractor = new PitchExtractor(pitchOptions);
var pitchTrack = pitchExtractor.ComputeFrom(signal)
                               .Select(p => p.Features[0])
                               .ToArray();

var opts = pitchOptions.Cast<PitchOptions, MultiFeatureOptions>();
opts.FrequencyBands = melBands

var mpeg7Extractor = new Mpeg7SpectralFeaturesExtractor(opts);
mpeg7Extractor.IncludeHarmonicFeatures("all");
mpeg7Extractor.SetPitchTrack(pitchTrack);

var mpeg7Vectors = mpeg7Extractor.ComputeFrom(signal);

var harmonicFeatures = mpeg7Extractor.HarmonicSet;
// ""hcentroid, hspread, inharmonicity, oer, t1+t2+t3";"

MFCC/PNCC

Since so many variations of MFCC have been developed since 1980, MfccExtractor class is very general and allows customizing pretty everything:

  • filterbank (by default it's MFCC-FB24 HTK/Kaldi-style)
  • non-linearity type (logE, log10, decibel (Librosa power_to_db analog), cubic root)
  • spectrum calculation type (power/magnitude normalized/not normalized)
  • DCT type (1,2,3,4 normalized or not): "1", "1N", "2", "2N", etc.
  • floor value for LOG-calculations (usually it's float.Epsilon; HTK default seems to be 1.0 and in librosa 1e-10 is used)

For configuration use MfccOptions (<- FilterbankOptions <- FeatureExtractorOptions):

MfccOptions class has a lot of properties, and there are broad possibilities for customization: you can pass your own filter bank, for instance bark bank, then the algorithm will become BFCC, or the gammatone bank, then it'll become GFCC, etc.

List of additional properties:

  • int FeatureCount (number of MFCC coefficients)
  • int FilterBankSize (24 by default)
  • double LowFrequency (0 by default, filter bank lower frequency)
  • double HighFrequency (samplingRate / 2 by default, filter bank upper frequency)
  • int FftSize (by default 0, i.e. it will be calculated automatically as the closest power of 2 to FrameSize)
  • float[][] FilterBank (by default null, i.e. filterbank will be generated from parameters above)
  • int LifterSize (22 by default)
  • string DctType ("1", "1N", "2", "2N", "3", "3N", "4", "4N")
  • NonLinearityType NonLinearity (0 - LogE, 1 - Log10, 2 - ToDecibel, 3 - CubicRoot, 4 - None)
  • SpectrumType SpectrumType (0 - Power, 1 - Magnitude, 2 - PowerNormalized, 3 - MagnitudeNormalized)
  • WindowTypes Window (by default, WindowTypes.Hamming)
  • float LogFloor (float.Epsilon by default)
  • bool IncludeEnergy (replace first coefficient with frame energy, false by default)
  • float LogEnergyFloor (by default float.Epsilon, floor to avoid -∞ in log-energy)

PnccOptions class used for PNCC and SPNCC extractors is similar and has one additional parameter:

  • int Power (by default 15)

If power is set to 0 then the Log(x) operation will be applied to spectrum before doing DCT-II. Otherwise the operation Pow(x, 1/power) will be applied.

Since PNCC involves temporal causal filtering, the first M (i.e. 2 by default) vectors will be zero.

PnccExtractor doesn't perform mean normalization in the end, but it can be done manually:

var sr = signal.SamplingRate;

var pnccExtractor = new PnccExtractor(pnccOptions);
var pnccVectors = pnccExtractor.ComputeFrom(signal, /*from*/1000, /*to*/10000 /*sample*/);
FeaturePostProcessing.NormalizeMean(pnccVectors);

Default filter bank in MFCC is triangular overlapping mel-bands. If you specify your own filter bank, then parameters LowFrequency, HighFrequency and FilterBankSize will be ignored. The same goes for PLP/PNCC/SPNCC. For example, let's change mel-bands to bark-bands and obtain actually a BFCC extractor:

var barkbands = FilterBanks.BarkBands(16, sr, 100/*Hz*/, 6500/*Hz*/, overlap: true);
var barkbank = FilterBanks.Triangular(512, sr, barkbands);

var bfccOptions = new MfccOptions
{
    SamplingRate = signal.SamplingRate,
    FeatureCount = 13,
    FilterBank = barkbank
};
var bfccExtractor = new MfccExtractor(bfccOptions);
var bfccVectors = bfccExtractor.ParallelComputeFrom(signal);

Still, manual configuration of filter banks can be very annoying and error-prone. There are two basic standards in MFCC computations: MFCC-FB24 (HTK, Kaldi) and MFCC-FB40 (Slaney's Auditory Toolbox, Librosa). Luckily, there are two MfccOptions subclasses in NWaves created just for that: MfccHtkOptions and MfccSlaneyOptions. The extractors based on these configs were thoroughly tested against HTK and librosa, and they gave similar results (there are some minor discrepancies, though).

For constructing both of these objects you need to specify only the basic parameters of filterbanks, and their weights will be created then automatically. Due to this special initialization scheme, parameters must be given in constructor (although they can still be changed later using properties):

// possible configurations:

var mfccHtkOptions = new MfccHtkOptions(16000, 13, 0.025, 200, 6800);
var mfccSlaneyOptions = new MfccSlaneyOptions(16000, 13, 0.025, 200, 6800);
// parameters are: sampling rate, number of coeffs, frame duration, lower and upper frequency in Hz

var mfccExtractor1 = new MfccExtractor(mfccHtkOptions);
var mfccExtractor2 = new MfccExtractor(mfccSlaneyOptions);

The following configs are close analogs of Librosa extractors (2 modes):

// === The difference is in filter banks: ========================================================

// filterbank 1) librosa: Htk = False (i.e. Slaney-style)

var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, samplingRate, lowFreq, highFreq);

// filterbank 2) librosa: Htk = True

var melBands = FilterBanks.MelBands(filterbankSize, samplingRate, lowFreq, highFreq);
var melBank = FilterBanks.Triangular(fftSize, samplingRate, melBands, null, Scale.HerzToMel);

// ================================================================================================

// and the extractor is:

var opts = new MfccOptions
{
    SamplingRate = samplingRate,
    FeatureCount = featureCount,
    Filterbank = melBank,
    NonLinearity = NonLinearityType.ToDecibel,
    LogFloor = 1e-10f,
    //...
};
var e = new MfccExtractor(opts);

More examples and illustrations: here and here

Additional notes

  • dctType: "2N" is applied by default in basically all standards of MFCC. Some authors mention type 3, and it might be the source of confusion because they mean inverse DCT-3N which is essentially the direct DCT-2N. So actually direct DCT-2N is used in MFCC algorithms.
  • includeEnergy: true means that the first MFCC coefficient will be replaced with Log(energy_of_the_frame) (or Log(logEnergyFloor) if the energy is less than options.LogEnergyFloor).

FilterbankExtractor

MFCC and PNCC extractors process signal using filter banks, post-process the resulting spectra and compress them using DCT. Sometimes, this last step is unnecessary and we're interested only in processed spectra. FilterbankExtractor was added to NWaves exactly for this purpose. It's just a MfccExtractor without DCT step.

For configuration use FilterbankOptions ( <- FeatureExtractorOptions).

Chroma feature extractor

ChromaExtractor allows obtaining the chromagram of signal.

Usually there's no need to tweak any parameters of extractor, other than the parameters in base FeatureExtractorOptions. However, for compliance with librosa chroma_stft() function, the analogous parameters were added:

var options = new ChromaOptions
{
     // basic parameters:

     SamplingRate = 16000,
     FrameSize = 2048,
     HopSize = 512,
     
     // additional parameters (usually no need to change their default values):

     FeatureCount = 12,
     Tuning = 0,
     CenterOctave = 5.0,
     OctaveWidth = 2,
     Norm = 2,
     BaseC = true
};

var chromaExtractor = new ChromaExtractor(options);

Note for librosa users. Parameter Norm is related to norm in function chroma() (method for constructing chroma filterbank), not the norm in the chroma_stft() function. In librosa the latter hides the former. So in librosa set norm = None.

# this is equivalent to settings in C# code above:
c = librosa.feature.chroma_stft(y, 16000, n_fft=2048, hop_length=512, center=False, tuning=0.0, norm=None)

# n_chroma = 12 corresponds to FeatureCount in C# code

If you want to change norm for chroma filterbank construction in librosa, then call this function explicitly and write your own code instead of chroma_stft. For example, if you want to change norm to None (in NWaves it's equivalent to options.Norm = 0):

S = np.abs(librosa.stft(
        y=y,
        n_fft=2048,
        hop_length=512,
        window='hann',
        center=False))

S = S**2 # power spectrogram

chromafb = librosa.filters.chroma(sr, 2048, tuning=0.0, n_chroma=12, norm=None)  # <- here
c = np.dot(chromafb, S)

LPC/LPCC

For LPC configuration use LpcOptions ( <- FeatureExtractorOptions):

  • int LpcOrder (LPC order, the number of coefficients will be order+1)
  • FeatureCount (ignored)

For LPCC configuration use LpccOptions (<- LpcOptions <- FeatureExtractorOptions):

  • int FeatureCount (number of LPCC coefficients, in general it's not equal to LpcOrder)
  • int LifterSize (by default 22)
var opts = new LpccOptions
{
    SamplingRate = signal.SamplingRate,
    LpcOrder = 10,
    FeatureCount = 14,
    LifterSize = 20
};
var lpcExtractor = new LpcExtractor(opts);
var lpccExtractor = new LpccExtractor(opts);

There's also a public static class Lpc with several useful methods:

Lpc.EstimateOrder()  // estimate optimal LPC order for a given sampling rate
Lpc.ToCepstrum()     // convert LPC to LPCC
Lpc.FromCepstrum()   // convert LPCC to LPC
Lpc.LevinsonDurbin() // Levinson-Durbin recursion
Lpc.ToLsf()          // convert LPC to Line Spectral Frequencies
Lpc.FromLsf()        // convert Line Spectral Frequencies to LPC

PlpExtractor

This extractor computes Perceptual Linear Prediction coefficients (with optional RASTA filtering). The extractor produces results quite close to HTK version.

For configuration use PlpOptions (<- FilterbankOptions <- FeatureExtractorOptions):

  • int FeatureCount (number of PLP coefficients)
  • int LpcOrder (order of LPC)
  • double Rasta (coefficient of RASTA filter, 0 by default (no filtering))
  • int FilterBankSize (24 by default)
  • double LowFrequency (0 by default, filter bank lower frequency)
  • double HighFrequency (samplingRate / 2 by default, filter bank upper frequency)
  • int FftSize (0 by default, i.e. it will be calculated automatically as the closest power of 2 to FrameSize)
  • int LifterSize (0 by default, i.e. no liftering)
  • WindowTypes Window (by default, WindowTypes.Hamming)
  • float[][] FilterBank (by default null, i.e. filterbank will be generated from parameters above)
  • double[] CenterFrequencies (by default null, i.e. they will be autocomputed)
  • IncludeEnergy (by default false, i.e. the first coefficient is not replaced with log-energy)
  • LogEnergyFloor (by default float.Epsilon, floor to avoid -∞ in log-energy)

If rasta coefficient > 0, then the extractor becomes stateful and non-parallelizable.

Filterbank center frequencies are needed by algorithm in order to obtain the equal loudness curve weights.

By default the bark filterbank suggested by M.Slaney is auto-generated, since bark scale was mentioned in original PLP paper (H.Hermansky). In Kaldi and HTK mel bank is applied, so for compliance with their results pass mel filterbank to PLP constructor:

var melBands = FilterBanks.MelBands(filterbankSize, fftSize, samplingRate, lowFreq, highFreq);
var melBank = FilterBanks.Triangular(fftSize, samplingRate, melBands, null, Scale.HerzToMel);

var opts = new PlpOtions { SamplingRate = samplingRate, FilterBank = melBank };
var plpExtractor = new PlpExtractor(opts);

WaveletExtractor

This extractor simply computes wavelet coefficients in each frame.

For configuration use WaveletOptions (<- FeatureExtractorOptions):

  1. string WaveletName (by default "haar")
  2. int FeatureCount (the number of first coefficients that should form the resulting feature vector)
  3. FwtLevel (by default 0, i.e. it's autocomputed max possible level)
  4. FwtSize (by default 0, i.e. it's autocomputed nearest power of 2 to frame size).
var opts = new WaveletOptions
{
    SamplingRate = signal.SamplingRate,
    FeatureCount = 64,
    FrameDuration = 512.0 / sr, // compute duration so that frame size = 512
    HopDuration = 512.0 / sr,   // compute duration so that hop size = 512
    WaveletName = "db5",
    FwtLevel = 3
};

var extractor = new WaveletExtractor(opts);
var vectors = extractor.ParallelComputeFrom(signal);

AMS

Amplitude Modulation Spectra extractor.

For configuration use AmsOptions (<- FeatureExtractorOptions):

  • int ModulationFftSize (64 by default)
  • int ModulationHopSize (4 by default)
  • int FftSize (by default 0, i.e. it will be calculated automatically as the closest power of 2 to FrameSize)
  • IEnumerable<float[]> Featuregram (null by default)
  • float[][] FilterBank (null by default)

If the filterbank is specified, than it will be used for calculations:

var opts = new AmsOptions
{
    SamplingRate = signal.SamplingRate,
    FrameDuration = 0.0625,
    HopDuration = 0.02,
    ModulationFftSize = 64,
    ModulationHopSize = 16,
    FilterBank = filterbank
};
var extractor = new AmsExtractor(opts);
var features = extractor.ComputeFrom(signal);

You can also specify featuregram (i.e. compute AMS for various featuregrams, not only spectrograms):

var mfccExtractor = new MfccExtractor(mfccOptions);
var vectors = mfccExtractor.ComputeFrom(signal);
FeaturePostProcessing.NormalizeMean(vectors);

var opts = new AmsOptions
{
    SamplingRate = signal.SamplingRate,
    Featuregram = vectors
};
var extractor = new AmsExtractor(opts);

If neither filterbank, nor featuregram are specified then the filterbank is generated automatically in AmsExtractor as 12 overlapping mel-bands covering frequency range from 100 to 3200 Hz.

Creating your own feature extractor

In order to create new feature extractor, the following steps should be taken:

  • Create new class and inherit it from FeatureExtractor base class
  • If your extractor needs more parameters for configuration than provided in FeatureExtractorOptions then create your class for options, inherit it from FeatureExtractorOptions and include additional parameters there
  • In constructor of the newly created class call base constructor accepting options
  • In constructor set value to FeatureCount property (required)
  • Override ProcessFrame() method and property FeatureDescriptions(you can also optionally override ComputeFrom() method for some exotic processing of frame sequences)
  • If the feature extractor can be parallelized, then override IsParallelizable property (set to true) and ParallelCopy virtual method. ParallelCopy must return the clone of current extractor object (with full set of identical parameters) for parallel computations.
  • Also, you can inherit from any of the existing feature extractors (they're not sealed). Moreover, all internal fields in extractor classes are intentionally made protected instead of private so that you could reuse them.

For demo, let's write new extractor that computes two values in each frame: maximum and minimum absolute differences between two neighbouring samples. Let's also assume we need additional parameter - distance between two neighbouring samples and it must be not less than 1.

public class DeltaStatsExtractor : FeatureExtractor
{
    private readonly int _distance;

    public override List<string> FeatureDescriptions => new List<string> { "minDelta", "maxDelta" };

    public DeltaStatsExtractor(DeltaStatsOptions opts) : base(opts)
    {
        FeatureCount = 2;
        _distance = opts.Distance;
    }

    public override void ProcessFrame(float[] samples, float[] features)
    {
        var minDelta = Math.Abs(samples[_distance] - samples[0]);
        var maxDelta = minDelta;

        for (var k = 1; k < FrameSize - _distance; k++)
        {
            var delta = Math.Abs(samples[k + _distance] - samples[k]);

            if (delta < minDelta) minDelta = delta;
            if (delta > maxDelta) maxDelta = delta;
        }

        features[0] = minDelta;
        features[1] = maxDelta;
    }

    public override bool IsParallelizable() => true;

    public override FeatureExtractor ParallelCopy() =>
              new DeltaStatsExtractor(
                   new DeltaStatsOptions
                   {
                       SamplingRate = SamplingRate,
                       FrameDuration = FrameDuration,
                       HopDuration = HopDuration,
                       PreEmphasis = _preEmphasis,
                       Window = _window,
                       Distance = _distance
                   });
}

[DataContract]
public class DeltaStatsOptions : FeatureExtractorOptions
{
    [DataMember]
    public int Distance { get; set; } = 1;

    public override List<string> Errors
    {
        get
        {
            var errors = base.Errors;
            if (Distance < 1) errors.Add("Distance must be not less than 1");
            return errors;
        }
    }
}

And once again:

Extractors don't do pre-emphasis and windowing in ProcessFrame() method even if you've specified corresponding parameters in extractor's constructor.

This is because in base FeatureExtractor class pre-emphasis and windowing is performed only during sequential processing of overlapping frames in ComputeFrom()/ParallelComputeFrom method (the reason for this is that the code looks cleaner and less convoluted).

So if you add pre-emphasis and windowing code in your ProcessFrame() implementation and you're going to call always only this method later then it's completely fine. However, if you'll try to call ComputeFrom() method, it will first apply base pre-processing and then call ProcessFrame() method with user-defined pre-processing again (thus, the samples will be pre-processed twice). IMO, this is not a big deal and there are several ways to write correct code here, but this nuance should be accounted for.

To sum up:

  1. don't add pre-emphasis and windowing code in ProcessFrame() of your FeatureExtractor subclass
  2. prefer OnlineFeatureExtractor.ComputeFrom() over FeatureExtractor.ProcessFrame()
Clone this wiki locally