-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read performance with/without missing #264
Comments
So question is, where do these additional allocations come from? Could one not write a function ignoremissing(dsvar)
# there's probably a more elegant way to write a shape-preserving raw data read?
# this is just because one doesn't know whether to write .var[:, :] or .var[:, :, :], ... depending on dims
raw = reshape(dsvar.var[:], (Base.OneTo(size(dsvar, i)) for i in 1:ndims(dsvar))...)
end
function nomissing(dsvar)
raw = ignoremissing(dsvar) # read and allocate the array once
# warn if there's missing value
missing_value = dsvar.attrib["_FillValue"]
missing_value in raw && @warn "Missing value in data"
return raw
end Benchmarking this is as fast as the raw data read and only allocates the array once. julia> @btime A = nomissing($ds["data"]);
490.540 ms (58 allocations: 95.37 MiB) |
Just realised that function value_in(val, collection)
return !isnothing(findfirst(x -> x === val, collection))
end (UPDATE: which however, currently is some 50% slower but at least not allocating (which I find the higher priority working with datasets) julia> @btime A = nomissing2($ds["data"]);
881.641 ms (59 allocations: 95.37 MiB) |
If I understand you well, your use case would be to load as efficiently as possibly an array of floats (with a Within the current API, one can use: ncv2 = cfvariable(ds,"data",maskingvalue=NaN32)
@btime A = $ncv2[:]; # or Array(ncv2) to preserve the shape
# output 341.959 ms (28 allocations: 190.74 MiB) With a small specialization (JuliaGeo/CommonDataModel.jl@ba34d89) for the case where the raw data type == transformed data type, I can get this down to: @btime A = $ncv2[:];
# output 316.530 ms (26 allocations: 95.37 MiB)
@btime A = Array($ncv2);
# output 320.732 ms (35 allocations: 95.37 MiB) With is the same amount of memory that your use case. Would that work for you? If for some reason the element type in the NetCDF variable changes to Int32 but with a scale factor of say The keywords of Concerning "2. ...only bit slower but requires more than twice the memory": yes, there is one array for the raw data and one array for the scaled data following the CF convention. I agree that in this particular case, the second array is not needed. |
(Motivated from #227 (comment))
Creating a fake dataset with some compression like
This file is now 24.8MB on disk so ~4x compression factor. Now benchmark the read + decompression
.var
So almost 200MB/s and it only allocates that 100MB that the uncompressed array requires.
Matrix{Union{Missing, Float32}}
only bit slower but requires more than twice the memory
nomissing(::CFVariable)
Takes absolutely forever, don't do this. See #227 (comment) -- maybe add a warning or remove the
nomissing(::CFVariable)
method?nomissing(::Array)
Bit slower again and 3x the allocations.
Array(::CFVariable)
Same as (2) but faster?
Array{T}(::CFVariable)
but providing target typeT
Don't do this, also takes forever, probably same as (3).
The text was updated successfully, but these errors were encountered: