From abb00ddaf3e070c4ce4f20c56f706097c3db915b Mon Sep 17 00:00:00 2001 From: Anthony Truskinger Date: Fri, 9 Aug 2019 12:25:38 +1000 Subject: [PATCH] Added notes on chunk size choices [no_ci] --- docs/faq.md | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 113 insertions(+) diff --git a/docs/faq.md b/docs/faq.md index 499300c4f..a1227f6ef 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -33,6 +33,119 @@ mind the target audience. You're in the right ballpark if: More than likely if you're stuck we can help 😊. +## Why do we analyze data in 1-minute chunks? + +There are a couple reasons, but mainly, because when we first started doing this, +computers were far less efficient than they are now. Computers are fundamentally +limited by the amount of data they can work with at one time; in technical terms, +the amount of data that can fit into main memory (RAM). By breaking the data into +smaller chunks, we could _stream_ the data through the analysis and our overall +analysis speed was greatly improved. + +We could have chosen any size from 30-seconds to 5-minutes, but +one-minute blocks also had nice temporal features (they compose well with data +at different resolutions) and are still detailed enough to answer questions in +our multi-day analyses. + +Today it seems to be the de facto standard to analyze data in one-minute blocks. +We suggest that it is still a good default for most use cases: + +- Computers don't have the same limitations as they did when we started, but small + blocks of data allow for parallel analysis that effectively utilizes all CPU cores +- While computer's are getting better, we're also doing more complex analyses. In + parallel we can use a large amount of RAM and most of the computer's CPU(s) for + the quickest analysis +- One-minute blocks still retain the nice temporally composable attributes detailed + above. +- And since one-minute blocks seem to be a defacto standard it does (by happenstance) + provide common ground for comparing data + +## What effect does chunk size have on data? + +For acoustic event recognition, typically boundary effects are the only effect +that are affected by chunk-size choice. +That is, if an acoustic event occurs is clipped +by either the start or end of the chunk, and is now only a partial vocalization, +a typical event recognizer may not detect it. + +For acoustic indices, from a theoretical point of view, chunk-size has the same kinds of issues as the choice of FFT frame length in speech processing. Because an FFT +assumes signal stationarity, one chooses a frame length over which the spectral +content in the signal of interest is approximately constant. In the case of +acoustic indices, one chooses an index calculation duration which captures +a sufficient amount of data for the acoustic feature that you are interested in. +The longer the analysis duration the more you blur or average out features of +interest. However, if you choose too short an interval in then the calculated +index may be dominated by "noise"---that is, features that are not of interest. +We find this is particularly the case with the ACI index; one-minute seems to be +an appropriate duration for the indices that are typically calculated. + +## Can I change the chunk size? + +The [config files](https://github.com/QutEcoacoustics/audio-analysis/tree/master/src/AnalysisConfigFiles) for most of our analyses contain these common settings: + +### Chunk-size: + +```yaml +# SegmentDuration: units=seconds; +# Long duration recordings are cut into short segments for more +# efficient processing. Default segment length = 60 seconds. +SegmentDuration: 60 +``` +Here, `SegmentDuration` is our name for the chunk-size. If, for example, you wanted +to process data in 5 minute chunks, you could change the configuration to +`SegmentDuration: 300`. + +### Chunk-overlap: + +```yaml +# SegmentOverlap: units=seconds; +SegmentOverlap: 0 +``` + +If you're doing event recognition and you're concerned about boundary effects, +you could change the overlap to `SegmentOverlap: 10`, which would ensure every +`SegmentDuration`-sized-chunk (typically one-minute in size) would be cut with +an extra, trailing, 10-second buffer. + +Note: we rarely change this setting and setting too much overlap may produce +duplicate events. + +### Index Calculation Duration (for the indices analysis only): + +For acoustic indices in particular, their calculation resolution depends on a +second setting that is limited by chunk-size (`SegmentDuration`): + +```yaml +# IndexCalculationDuration: units=seconds (default=60 seconds) +# The Timespan (in seconds) over which summary and spectral indices are calculated +# This value MUST not exceed value of SegmentDuration. +# Default value = 60 seconds, however can be reduced down to 0.1 seconds for higher resolution. +# IndexCalculationDuration should divide SegmentDuration with MODULO zero +IndexCalculationDuration: 60.0 +``` + +If you wanted indices calculated over a duration longer than one-minute, you +could change the `SegmentDuration` and the `IndexCalculation` duration to higher +values: + +```yaml +SegmentDuration: 300 +IndexCalculationDuration: 300 +``` + +However, we suggest that there are better methods for calculating low-resolution +indices. A method we often use is to calculate indices at a 60.0 seconds resolution +and aggregate the values into lower resolution chunks. The aggregation method +can provide some interesting choices: + +- We've seen the maximum, median, or minimum value for a block of indices + chosen (and sometimes all 3). + - though be cautious when using a mean function, it can skew the value of + logarithmic indices +- And we've seen a block of indices flattened into a larger feature vector and + fed to a machine learning or clustering algorithm + + ## We collect metrics/statistics; what information is collected and how is it used? **NOTE: this is an upcoming feature and has not been released yet**