Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking option for TileDBwriter #74

Open
eggio opened this issue Sep 28, 2021 · 3 comments
Open

Chunking option for TileDBwriter #74

eggio opened this issue Sep 28, 2021 · 3 comments

Comments

@eggio
Copy link

eggio commented Sep 28, 2021

Adding the option to write data from a netCDF file to tileDB array chunk-wise to avoid running in a MemoryError. Also give the option to choose along which dimension to chunk.

Example: I am trying to convert a >13GB netCDF file to a single sparse tileDB array but can not manage to do this with the TileDBwriter due to being limited with <8GB RAM.

@DPeterK
Copy link
Contributor

DPeterK commented Sep 30, 2021

Hi @eggio, thanks for raising this - and great use-case! As you can tell, I hadn't considered this use-case at all when putting tiledb-netcdf together... because I never encountered a NetCDF file sufficiently large that it wouldn't fit into memory. The challenge from the TileDB side is that you have to write all of the contents of an array in one go. In the way that tiledb-netcdf is designed, this minimum unit is a single NetCDF file, which as you have found causes issues when you have a large volume of data to write. I believe this requirement from the TileDB side is being investigated by the TileDB devs, so it may become possible in the future to stream chunks from a NetCDF file to an open TileDB array object.

Thinking about this does expose a difficulty and thus a possible solution as well, however. Even if you could stream a NetCDF file chunk-wise to the open TileDB array, you'd still need to know where index-wise within the array each chunk should end up, so that the TileDB array is a faithful representation of the original NetCDF file. Of course tiledb-netcdf already has that functionality in the append pipeline, which suggests a route to a workaround and possible enhancement.

So, the workaround: assuming you have the disk space, you could split your large NetCDF file into multiple small NetCDF files along your chunking dimension using a tool such as Iris or xarray (or even the NetCDF commandline tools). You could then append these successively to a TileDB array using tiledb-netcdf's current array append pipeline, and that would achieve the result you're after at the cost of (temporarily) having more NetCDF files to handle.

And the possible enhancement: currently tiledb-netcdf assumes you want to write the entire data variable of a NetCDF file to the TileDB array. This happens in the write_array helper function. This could be adjusted so that (optionally) the indices passed to the writer helper function could be applied to subset the NetCDF data variable as well as specifying the extents within the TileDB array to which the NetCDF data variable should be written. I'll raise this as a separate issue so that the idea is not lost in this discussion!

@eggio
Copy link
Author

eggio commented Oct 4, 2021

Thanks for the fast answer with the workaround fot the chunking for tiledb_netcdf.
I tried your suggestion. I generated sub netCDF files from my initial file using CDO.
I get and Error with "writer.append(append_files, unlimited_dims, data_array_name)" though. This is the Error:

"ValueError: not enough values to unpack (expected 1, got 0)"

In "\nctotdb\writers\tiledb.py", line 731, in _dim_points, step, = np.unique(np.diff(points))

I have no idea what the origin of this Error is, could you help me?

my append_files list is just a list with the filenames of the netCDF files and the files are directly in my working directory

@DPeterK
Copy link
Contributor

DPeterK commented Oct 8, 2021

Hi @eggio - it's a little tricky without being able to see your sub NetCDF files, but I wonder if CDO has split them such that they're scalar in the dimension you need to append them along. If we imagine a stylised view of your original NetCDF file and the sub-files that result, you might have ended up with something like the following:

my_file.nc: t - 3, y - 96, x - 172

my_subfile_1.nc: y - 96, x - 172
                 scalar t: 0
my_subfile_2.nc: y - 96, x - 172
                 scalar t: 1
my_subfile_3.nc: y - 96, x - 172
                 scalar t: 2

Scalar coordinates don't behave the same as dimensioned coordinates (they don't have a shape, among other things). That might be sufficient to cause the error you've encountered. In that case, you'll need to promote the scalar coordinate to a length-1 dimension coordinate. Continuing our stylistic view of NetCDF files, this would have the following impact on the shape of all the sub-files (although I'm only showing one here for brevity!):

my_subfile_1.nc: t - 1, y - 96, x - 172

However, tiledb_netcdf will still need some help to append these files, as you can't calculate a step between points from a single point! In that case you'll need to specify a baseline offset when you run the append, so that the append operation knows what to look at to calculate the offset between successive single points. This looks like:

writer.append(append_files, unlimited_dims, data_array_name, baselines=append_files[0])

As an aside, it looks like you're using a previous version of tiledb_netcdf. The latest release includes some significant updates to the append process - it won't definitely solve the issue you're having, but it should make it more likely that the append will function! If you can update, I recommend it 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants