-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance : speeding up pickling of cftime arrays #253
Comments
Very interesting. I agree it seems there is significant room for improvement. I think we may be able to follow a similar line of thought as to what led to speedups in the creation of cftime objects (see discussion in pangeo-data/pangeo#764 (comment) and #158). Pickling in cftime calls the Lines 1249 to 1250 in 4f28eb6
|
Oops |
A fairer comparison might be to look at the performance of pickling an array of
Unfortunately nothing obvious now sticks out to me to close this gap.
|
Indeed, I am not surprised that it takes more time than a While this might not be the best place, @spencerkclark, you might able to give an opinion. This problem happens in xarray workflows when using |
I'm somewhat surprised that encoding a cftime array is faster than pickling it (encoding requires repeated timedelta arithmetic, which is not needed for pickling). Have you done timing experiments to demonstrate this? Or is the issue that |
Exactly, the difference is in scale. Let's say we have a 3D (spatial+temporal) array divided in 100 spatial chunks ( |
Ah...that makes perfect sense now, thanks. Indeed it does seem like the optimization might best take place before cftime is involved. If you can put together a simple example that demonstrates this performance bottleneck it might be interesting to get folks' thoughts in an xarray issue. |
I used cftime 1.4.1 and 1.5.0 when exploring this.
My worklfows involve large datasets and complex functions. I use xarray, backed-up by dask. In one of the more complex processing, I use xarray's
map_blocks
and a handful of other dask-lazy methods on a large dataset that uses the NoLeap calendar. The dataset is large with 950 chunks and a 55114-element time coordinate. It seems a lot of time is spent in pickling the latter.More precisely, this line of dask : https://github.com/dask/dask/blob/1c4a84225d1bd26e58d716d2844190cc23ebcfec/dask/base.py#L1028 calls
pickle.dumps
on the numpy array of type O that stores the cftime.Datetime objects.When profiling the graph creation (no computation triggered yet), I can see that this step is the one that takes the most time. Slightly more than another function in xarray's CFTimeIndex creation.
MWE:
timeit
calls in a notebook:So even if it is normal that pickiling an object array is slower, the cftime array is still 2 orders of magnitude slower than a basic array. I am not very knowledgeable in how
pickle
works, but I believe something could be made to speed this up.Any ideas?
The text was updated successfully, but these errors were encountered: