-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunking option for TileDBwriter #74
Comments
Hi @eggio, thanks for raising this - and great use-case! As you can tell, I hadn't considered this use-case at all when putting Thinking about this does expose a difficulty and thus a possible solution as well, however. Even if you could stream a NetCDF file chunk-wise to the open TileDB array, you'd still need to know where index-wise within the array each chunk should end up, so that the TileDB array is a faithful representation of the original NetCDF file. Of course So, the workaround: assuming you have the disk space, you could split your large NetCDF file into multiple small NetCDF files along your chunking dimension using a tool such as Iris or xarray (or even the NetCDF commandline tools). You could then append these successively to a TileDB array using And the possible enhancement: currently |
Thanks for the fast answer with the workaround fot the chunking for tiledb_netcdf. "ValueError: not enough values to unpack (expected 1, got 0)" In "\nctotdb\writers\tiledb.py", line 731, in _dim_points, step, = np.unique(np.diff(points)) I have no idea what the origin of this Error is, could you help me? my append_files list is just a list with the filenames of the netCDF files and the files are directly in my working directory |
Hi @eggio - it's a little tricky without being able to see your sub NetCDF files, but I wonder if CDO has split them such that they're scalar in the dimension you need to append them along. If we imagine a stylised view of your original NetCDF file and the sub-files that result, you might have ended up with something like the following:
Scalar coordinates don't behave the same as dimensioned coordinates (they don't have a shape, among other things). That might be sufficient to cause the error you've encountered. In that case, you'll need to promote the scalar coordinate to a length-1 dimension coordinate. Continuing our stylistic view of NetCDF files, this would have the following impact on the shape of all the sub-files (although I'm only showing one here for brevity!):
However, tiledb_netcdf will still need some help to append these files, as you can't calculate a step between points from a single point! In that case you'll need to specify a baseline offset when you run the append, so that the append operation knows what to look at to calculate the offset between successive single points. This looks like: writer.append(append_files, unlimited_dims, data_array_name, baselines=append_files[0]) As an aside, it looks like you're using a previous version of tiledb_netcdf. The latest release includes some significant updates to the append process - it won't definitely solve the issue you're having, but it should make it more likely that the append will function! If you can update, I recommend it 🙂 |
Adding the option to write data from a netCDF file to tileDB array chunk-wise to avoid running in a MemoryError. Also give the option to choose along which dimension to chunk.
Example: I am trying to convert a >13GB netCDF file to a single sparse tileDB array but can not manage to do this with the TileDBwriter due to being limited with <8GB RAM.
The text was updated successfully, but these errors were encountered: