Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance hit when using dask to compute missing_wmo #1820

Open
1 of 2 tasks
huard opened this issue Jul 3, 2024 · 2 comments
Open
1 of 2 tasks

Performance hit when using dask to compute missing_wmo #1820

huard opened this issue Jul 3, 2024 · 2 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed priority Immediate priority
Milestone

Comments

@huard
Copy link
Collaborator

huard commented Jul 3, 2024

Setup Information

  • Xclim version: 0.47
  • Python version: 3.10

Description

I noticed a very significant performance hit when running missing_wmo with and without dask. The dask version seems to slow things down, and I had to load the array to get results in a reasonable amount of time.

Steps To Reproduce

No response

Additional context

No response

Contribution

  • I would be willing/able to open a Pull Request to address this bug.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@huard huard added the bug Something isn't working label Jul 3, 2024
@aulemahal
Copy link
Collaborator

MWE and dask analysis:

import xclim as xc
from xclim.testing import open_dataset

# Open a dataset of a single chunk
ds = open_dataset('sdba/CanESM2_1950-2100.nc', chunks={'time': -1, 'location': -1})

pr_valid = xc.core.missing.missing_wmo(ds.pr, freq="YS")

The last line took me 115 s. But most importantly, counting the number of tasks with len(ds.pr.__dask_graph__().keys()), I see an increase from 6 to 95304. This is insane!

A probable solution would be to wrap as much as possible into single apply_ufunc or map_blocks call to aggregate the tasks. Maybe we can look into flox for help since we are applying a function along the time axis and grouping.

@aulemahal
Copy link
Collaborator

Adding a subissue: the src_timestep argument of the MissingWMO method is not inferred from the data. Rather it is taken from Indicator.src_freq, but that property is not set on all indicators. We already infer_freq further in the process and we do that in the missing_wmo shortcut function, so it makes sense that the infer_freq would happen in the __init__ instead, so that all uses of the class would go through it.

Noted by @RondeauG .

@Zeitsperre Zeitsperre self-assigned this Dec 10, 2024
@Zeitsperre Zeitsperre added the help wanted Extra attention is needed label Dec 10, 2024
@Zeitsperre Zeitsperre added this to the v0.55.0 milestone Dec 10, 2024
@Zeitsperre Zeitsperre added the priority Immediate priority label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed priority Immediate priority
Projects
None yet
Development

No branches or pull requests

3 participants