Evaluation

Evaluating models

Once a data set has been labeled [see [Labeling][Labeling), one or more models may be evaluated for accuracy using the evaluate tool. Evaluation uses K-fold cross validation in which the data set is randomly divided into K folds or partitions, K-1 folds are used to train the model and the held out partition is used as a test set to determine accuracy. K combinations of K-1 and held out data are iteratively trained and tested, with results from each held out fold averaged to produce an averaged set of accuracy numbers. Accuracy of the model is identified by a confusion matrix and three key metrics, precision, recall and F1 score. The evaluate tool provides each of these.

Example

We consider a set of data as follows:

% sound-info -sounds metadata.csv
Loading aisp properties from file:/c:/dev/aisp/aisp.properties


Loading sounds from [metadata.csv]
Total: 103 samples, 1 label(s), 00:01:38.188 hr:min:sec
Label: class, 103 samples, 3 value(s), 00:01:38.188 hr:min:sec
  Value: ambient, 61 samples, 00:01:11.960 hr:min:sec
  Value: click, 23 samples, 00:00:5.076 hr:min:sec
  Value: voice, 19 samples, 00:00:21.152 hr:min:sec

For our example, we will use the lpknn model. Typically, models require a constant/fixed length of sound to be trained on and to classify. The length of this clip generally depends on the target sounds. A general rule of thumb is that the label applied should represent 75% of the clip. So for very short events, say of length 500-1000 milliseconds, you should probably not use a clip length of larger than 1 second. The sounds contain a very short event, click, which requires a short clip length of 250 msec. Using such a clip length, we can now see the inventory of sounds.

% sound-info -sounds metadata.csv -clipLen 250 
Loading aisp properties from file:/c:/dev/aisp/aisp.properties


Sounds will be clipped every 250 msec into 250 msec clips (padding=NoPad)
Total: 342 samples, 1 label(s), 00:01:25.500 hr:min:sec
Label: class, 342 samples, 3 value(s), 00:01:25.500 hr:min:sec
  Value: ambient, 261 samples, 00:01:5.250 hr:min:sec
  Value: click, 5 samples, 00:00:1.250 hr:min:sec
  Value: voice, 76 samples, 00:00:19.000 hr:min:sec

Note that we only have 5 clicks which is relatively small compared to the other label values. This can partially be addressed by padding clips that are too short with duplicate data, as follows:

% sound-info -sounds metadata.csv -clipLen 250 -pad duplicate
Loading aisp properties from file:/c:/dev/aisp/aisp.properties


Loading sounds from [metadata.csv]
Sounds will be clipped every 250 msec into 250 msec clips (padding=DuplicatePad)
Total: 395 samples, 1 label(s), 00:01:38.750 hr:min:sec
Label: class, 395 samples, 3 value(s), 00:01:38.750 hr:min:sec
  Value: ambient, 288 samples, 00:01:12.000 hr:min:sec
  Value: click, 24 samples, 00:00:6.000 hr:min:sec
  Value: voice, 83 samples, 00:00:20.750 hr:min:sec

Padding is not done when the remaindered clip is less than half the requested length. Yet another technique to increase the number of samples is to use sliding instead of rolling windows when clipping the sounds. We can enable this with the -clipShift option. Any shift length can be used but a good starting point may be half the clip length. So, using this option we see the following:

sound-info -sounds . -clipLen 250 -pad duplicate -clipShift 125
Loading aisp properties from file:/c:/dev/aisp/aisp.properties

Loading sounds from [.]
Sounds will be clipped every 125 msec into 250 msec clips (padding=DuplicatePad)
Total: 736 samples, 1 label(s), 00:03:4.000 hr:min:sec
Label: class, 736 samples, 3 value(s), 00:03:4.000 hr:min:sec
  Value: ambient, 549 samples, 00:02:17.250 hr:min:sec
  Value: click, 28 samples, 00:00:7.000 hr:min:sec
  Value: voice, 159 samples, 00:00:39.750 hr:min:sec

This has improved the numbers, but ideally, there would be an even number of samples across each label value. When this is not available, balancing of the data can be done for both evaluation and training. For evaluation, the -kfoldBalance option is used (see -balance-with option for the train tool). So to evaluate our model,

% evaluate -model lpknn -sounds metadata.csv -clipLen 250 -pad duplicate -clipShift 125 \
          -label class -cm -folds 3 -kfoldBalance max
Loading aisp properties from file:/c:/dev/aisp/aisp.properties


Loading sounds from [metadata.csv]
Sounds will be clipped every 125 msec into 250 msec clips (padding=DuplicatePad)
Warning: Nashorn engine is planned to be removed from a future JDK release
Training and evaluating classifier (LpDistanceMergeKNNClassifier)
Sounds : Total: 736 samples, 1 label(s), 00:03:4.000 hr:min:sec
Label: class, 736 samples, 3 value(s), 00:03:4.000 hr:min:sec
  Value: ambient, 549 samples, 00:02:17.250 hr:min:sec
  Value: click, 28 samples, 00:00:7.000 hr:min:sec
  Value: voice, 159 samples, 00:00:39.750 hr:min:sec
Evaluating 3 of 3 folds  with balanced training data at 549 samples per label value.
Evaluation completed in 3198 msec. 1066 msec/fold (computed in parallel)
Evaluated label name: class
COUNT MATRIX:
Predicted  ->[   ambient   ][    click    ][    voice    ]
ambient    ->[ *    547 *  ][        0    ][        2    ]
click      ->[        6    ][ *     22 *  ][        0    ]
voice      ->[        9    ][        1    ][ *    149 *  ]

PERCENT MATRIX:
Predicted  ->[   ambient   ][    click    ][    voice    ]
ambient    ->[  * 74.32 *  ][     0.00    ][     0.27    ]
click      ->[     0.82    ][  *  2.99 *  ][     0.00    ]
voice      ->[     1.22    ][     0.14    ][  * 20.24 *  ]

Label           |  Count |        F1 | Precision |    Recall
ambient         |    549 |    98.470 |    97.331 |    99.636
click           |     28 |    86.275 |    95.652 |    78.571
voice           |    159 |    96.129 |    98.675 |    93.711
Micro-averaged  |    736 |    97.554 |    97.554 |    97.554
Macro-averaged  |    736 |    93.532 |    97.088 |    90.622

Precision:  97.554 +/- 0.5792% (micro),  97.088 +/- 2.5388% (macro)
Recall   :  97.554 +/- 0.5792% (micro),  90.622 +/- 3.2263% (macro)
F1       :  97.554 +/- 0.5792% (micro),  93.532 +/- 2.9625% (macro)

The above shows both the confusion matrices and the various precision, recall and F1 metrics. Note that the confusion matrix does NOT show balanced data as only the K-1 training folds are balanced on each iteration, not the held out test set. At this point, you can choose to use the model as is or tune it using your JavaScript model definition (see Models for how to define your own model).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation

Evaluating models

Example

Clone this wiki locally