You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In compute_dataset_taxonomy_stats(https://github.com/MTG/freesound-datasets/blob/master/datasets/tasks.py#L89) we carry out one single big query to get the number of annotations and the number of sounds for all taxonomy categories, and then we compute one extra small query for each category to get the number of non-validated annotations. There are probably two ways to optimise this:
Get all the information for all categories in a single query. We tried to do that but the resulting query took a really long time (~hours) to compute for full sized dataset (i.e., 250k sounds, 500k annotations approx). We reverted back to use separate queries as a quick solution to get this function usable, but maybe this query can be improved and run quickly. One way that it could be surely improved is by adding an is_validated field in the datasets.models.Annotation model which gets updated when new votes for an annotation are created. However, first option would be to try to be fast without needing to store that intermediate value.
Get all the information regarding num sounds and num annotations in one single big query (like now), and get all the information regarding num non validated annotations in another single big query (so running 2 big queries instead of 1 + 1 * num categories).
The text was updated successfully, but these errors were encountered:
ffont
changed the title
Optimize compute_dataset_taxonomy_stats queries
Optimize compute_dataset_taxonomy_stats queries
Mar 28, 2017
ffont
changed the title
Optimize compute_dataset_taxonomy_stats queries
Optimize compute_dataset_taxonomy_stats queries
Mar 28, 2017
In
compute_dataset_taxonomy_stats
(https://github.com/MTG/freesound-datasets/blob/master/datasets/tasks.py#L89) we carry out one single big query to get the number of annotations and the number of sounds for all taxonomy categories, and then we compute one extra small query for each category to get the number of non-validated annotations. There are probably two ways to optimise this:Get all the information for all categories in a single query. We tried to do that but the resulting query took a really long time (~hours) to compute for full sized dataset (i.e., 250k sounds, 500k annotations approx). We reverted back to use separate queries as a quick solution to get this function usable, but maybe this query can be improved and run quickly. One way that it could be surely improved is by adding an
is_validated
field in thedatasets.models.Annotation
model which gets updated when new votes for an annotation are created. However, first option would be to try to be fast without needing to store that intermediate value.Get all the information regarding num sounds and num annotations in one single big query (like now), and get all the information regarding num non validated annotations in another single big query (so running 2 big queries instead of 1 + 1 * num categories).
The text was updated successfully, but these errors were encountered: