Optimize compute_dataset_taxonomy_stats queries #6

ffont · 2017-03-28T08:31:53Z

In compute_dataset_taxonomy_stats(https://github.com/MTG/freesound-datasets/blob/master/datasets/tasks.py#L89) we carry out one single big query to get the number of annotations and the number of sounds for all taxonomy categories, and then we compute one extra small query for each category to get the number of non-validated annotations. There are probably two ways to optimise this:

Get all the information for all categories in a single query. We tried to do that but the resulting query took a really long time (~hours) to compute for full sized dataset (i.e., 250k sounds, 500k annotations approx). We reverted back to use separate queries as a quick solution to get this function usable, but maybe this query can be improved and run quickly. One way that it could be surely improved is by adding an is_validated field in the datasets.models.Annotation model which gets updated when new votes for an annotation are created. However, first option would be to try to be fast without needing to store that intermediate value.
Get all the information regarding num sounds and num annotations in one single big query (like now), and get all the information regarding num non validated annotations in another single big query (so running 2 big queries instead of 1 + 1 * num categories).

The text was updated successfully, but these errors were encountered:

ffont changed the title ~~Optimize compute_dataset_taxonomy_stats queries~~ Optimize compute_dataset_taxonomy_stats queries Mar 28, 2017

ffont changed the title ~~Optimize compute_dataset_taxonomy_stats queries~~ Optimize compute_dataset_taxonomy_stats queries Mar 28, 2017

xavierfav added the implementation label Nov 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize compute_dataset_taxonomy_stats queries #6

Optimize compute_dataset_taxonomy_stats queries #6

ffont commented Mar 28, 2017

Optimize compute_dataset_taxonomy_stats queries #6

Optimize compute_dataset_taxonomy_stats queries #6

Comments

ffont commented Mar 28, 2017