-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow open_dataset
for large NetCDF files
#460
Comments
Do you have a link of the file where I could test? I open such files almost daily and 2 minutes seems high. Is it for the 1st call of the function or for the 2nd call? |
It was similar times on the first and second calls. These files are not "online", they're on Gadi at NCI (the Australian cluster). Is there a place where I can upload one file (about 5GB) for you to test? |
Can you download them on your computer and tests when the file is local? Perhaps the problem is more within the "http/downloads" packages and not with YAXArrays ? As for a file, I don't know, it might be hard on my side within my corporate firewall to easily download your file with most hosting providers. Is the NCI url "open" ? |
I already work "locally" in the sense that I don't download the files and instead use a compute node that has direct access to the files. These files on NCI are not accessible without an account with them, hence why I was offering to upload one to somewhere "open". |
ok, I understand. Sometimes on cluster, the filesystem (e.g. gpfs or nfs) can be slow. If you have a lot of I/O, it can be worthwile to copy the file(s) on the current used node for the calculations. For example, on our clusters, this is something like
In parallel, I am not sure if YAXArrays uses the |
Zarr.jl should use the consolidated metadata that is in .zmetadata if available, but this is not applicable here, because he is dealing with NetCDF data. You should be able to upload an example file here: |
@felixcremer I put one such file on my google drive that I can share, if that works? |
That works. |
Sent an invite to your email (from your GitHub profile) |
I managed to reproduce this locally on my laptop with your dataset. So this is not a file system issue, but this is rather a YAXArrays issue. julia> @time RasterStack("ocean_month_19901231.nc", lazy=true)
┌ Warning: unsupported calendar `GREGORIAN`. Time units are ignored.
└ @ CommonDataModel ~/.julia/packages/CommonDataModel/G3moc/src/cfvariable.jl:203
┌ Warning: unsupported calendar `GREGORIAN`. Time units are ignored.
└ @ CommonDataModel ~/.julia/packages/CommonDataModel/G3moc/src/cfvariable.jl:203
16.379549 seconds (15.43 M allocations: 928.044 MiB, 2.22% gc time, 98.40% compilation time)
╭────────────────── |
The main problem here is that in the current implementation YAXArrays keeps opening and closing the file several times for every variable inside it which becomes a bit costly. One way to speed this up would be to go back to a NetCDF backend that just maintains a handle to the open file like we did in the past, e.g. by defining this: import YAXArrayBase as YAB
using NetCDF
YAB.get_var_dims(ds::NetCDF.NcFile,name) = map(i->i.name,ds[name].dim)
YAB.get_varnames(ds::NetCDF.NcFile) = collect(keys(ds.vars))
YAB.get_var_attrs(ds::NetCDF.NcFile, name) = copy(ds[name].atts)
YAB.get_global_attrs(ds::NetCDF.NcFile) = copy(ds.gatts)
YAB.allow_parallel_write(::Type{<:NetCDF.NcFile}) = false
YAB.allow_parallel_write(::NetCDF.NcFile) = false
YAB.allow_missings(::Type{<:NetCDF.NcFile}) = false
YAB.allow_missings(::NetCDF.NcFile) = false
Base.haskey(ds::NetCDF.NcFile,k) = haskey(ds.vars,k) then open the file is very fast: using YAXArrays
nc = NetCDF.open(file)
@time open_dataset(nc);
@time open_dataset(nc); opens the dataset very quickly. However, this means that a handle to the NetCDF file is kept open which does not scale well if you want to lazily concatenate thousands of NetCDF files in a big multi-file dataset, which was the main reason for us to move to lazy file opening. A solution to all problems would be to open the file only for the time the YAXArray is created and all metadata is parsed and to switch to the lazy representation afterwards which means we would need to add some context concept in YAXArrayBase. I am happy to implement this but the question remains if it is really worth the effort when the medium-term plan is to move the file opening out of YAXArrays and rather rely on functionality implemented in Rasters.jl to open YAXArray datasets. |
Thanks! And congrats on figuring out the issue! I just wanted to say that in my case I worked around the issue by "preprocessing" the data in python (essentially selected the variables I needed and saved them in separate files), so no pressure from me! |
this should be fixed by #470 |
Working with some climate model data, I have large (5GB) NetCDF files for each year of simulation that contain many variables (about 50) at monthly intervals. Just "opening" one of these files with
open_dataset
takes order 2 minutes and a lot of allocations and memory use. In comparison, xarray'sopen_dataset
takes about 2 seconds for the same file:What am I doing wrong?
The text was updated successfully, but these errors were encountered: