-
Notifications
You must be signed in to change notification settings - Fork 3
Labeling
In order to build a model that can generate classification results, a model must be trained on labeled data. The toolkit uses a simple comma-separated value (CSV) file to assign labels, and optionally other metadata using name/value pair tags, to .wav files and/or segments of the .wav files. The file defining the labeling is typically named metadata.csv and is usually in the same directory as the sounds, although neither of these is a strict requirement. For example,
and the contents of the metadata.csv file as follows:
CarEngine.wav,source=engine;status=normal,
Compressor.wav,source=compressor;status=normal,
Machine.wav,source=machine;status=normal,
OneCylinderEngine.wav[100-1000],source=engine;status=abnormal,
Formally, the format is as follows:
- There is no header (although earlier versions used one).
- Column 1 - contains the name of the wav file. If relative, then the file is located relative to the location of the metadata.csv file. Optionally, a segment of the referenced file may be specified using start and end millisecond offsets into the wav file, as shown above, using the
[start-end]
notation. - Column 2- a semi-colon (;) separate list of label=value pairs specifying the name of the label and its value.
- Column 3 - a semi-colon (;) separate list of tag=value pairs specifying the name of the tag and its value. Tags are sometimes used by client applications to hold application data such as device source, start time, etc.
Any of the command line tools that read labeled sounds from the file system, generally require a metadata.csv file to provide the labels.
Audacity is an open source UI tool used to work with audio files. Among many of its capabilities are playback, visualization and labeling of audio segments. Our focus here is on using Audacity to define labels and convert them into the metadata.csv file format required by the CLI tools.
After opening an audio file (and viewing the spectrogram), we can apply label values as shown below:
In short:
- Open audacity on a selected wav file (myaudio.wav).
- Click in the spectrogram and drag to select/identify a segment of the sound.
- Then use the Edit->Label->Add option or it short-cut to create and label the selected segment.
- Use the Export->Labels option to save the labels as a text file to disk (e.g. myaudio.txt).
- Use the audacity2metadata CLI to convert the Audacity labels file to a metadata.csv file:
audacity2metadata -label state -wav myaudio.wav -audacity myaudio.txt > metadata.csv
- Note: this requires python 3. If you're default python command is not python 3, set the PYTHON env var to be the name of your python 3 executable (e.g. export PYTHON=python3.8).
- Use the sound-info CLI to test the metadata file:
sound-info -sounds metadata.csv
With a metadata-formatted csv file in hand, you can now list the contents of your data set using the sound-info tool as follows:
% sound-info -sounds metadata.csv
Loading sounds from [metadata.csv]
Total: 103 samples, 1 label(s), 00:01:38.188 hr:min:sec
Label: class, 103 samples, 3 value(s), 00:01:38.188 hr:min:sec
Value: ambient, 61 samples, 00:01:11.960 hr:min:sec
Value: click, 23 samples, 00:00:5.076 hr:min:sec
Value: voice, 19 samples, 00:00:21.152 hr:min:sec
The above listes the raw wav file and segments without any clipping, which might be used while evaluating or training model. To see the effects of clipping and balancing, for example:
% sound-info -sounds metadata.csv -clipLen 1000
Loading sounds from [metadata.csv]
Sounds will be clipped every 1000 msec into 1000 msec clips (padding=NoPad)
Total: 56 samples, 1 label(s), 00:00:56.000 hr:min:sec
Label: class, 56 samples, 2 value(s), 00:00:56.000 hr:min:sec
Value: ambient, 42 samples, 00:00:42.000 hr:min:sec
Value: voice, 14 samples, 00:00:14.000 hr:min:sec
Finally, you may also be balancing your data during training. To see its effect, you can use the following (the -label option is required when balancing data):
% sound-info -sounds metadata.csv -clipLen 1000 -balance-with up -label class
Loading sounds from [metadata.csv]
Sounds will be clipped every 1000 msec into 1000 msec clips (padding=NoPad)
Total: 84 samples, 1 label(s), 00:01:24.000 hr:min:sec
Label: class, 84 samples, 2 value(s), 00:01:24.000 hr:min:sec
Value: ambient, 42 samples, 00:00:42.000 hr:min:sec
Value: voice, 42 samples, 00:00:42.000 hr:min:sec