Chunking option for TileDBwriter #74

eggio · 2021-09-28T14:16:29Z

Adding the option to write data from a netCDF file to tileDB array chunk-wise to avoid running in a MemoryError. Also give the option to choose along which dimension to chunk.

Example: I am trying to convert a >13GB netCDF file to a single sparse tileDB array but can not manage to do this with the TileDBwriter due to being limited with <8GB RAM.

DPeterK · 2021-09-30T08:22:25Z

Hi @eggio, thanks for raising this - and great use-case! As you can tell, I hadn't considered this use-case at all when putting tiledb-netcdf together... because I never encountered a NetCDF file sufficiently large that it wouldn't fit into memory. The challenge from the TileDB side is that you have to write all of the contents of an array in one go. In the way that tiledb-netcdf is designed, this minimum unit is a single NetCDF file, which as you have found causes issues when you have a large volume of data to write. I believe this requirement from the TileDB side is being investigated by the TileDB devs, so it may become possible in the future to stream chunks from a NetCDF file to an open TileDB array object.

Thinking about this does expose a difficulty and thus a possible solution as well, however. Even if you could stream a NetCDF file chunk-wise to the open TileDB array, you'd still need to know where index-wise within the array each chunk should end up, so that the TileDB array is a faithful representation of the original NetCDF file. Of course tiledb-netcdf already has that functionality in the append pipeline, which suggests a route to a workaround and possible enhancement.

So, the workaround: assuming you have the disk space, you could split your large NetCDF file into multiple small NetCDF files along your chunking dimension using a tool such as Iris or xarray (or even the NetCDF commandline tools). You could then append these successively to a TileDB array using tiledb-netcdf's current array append pipeline, and that would achieve the result you're after at the cost of (temporarily) having more NetCDF files to handle.

And the possible enhancement: currently tiledb-netcdf assumes you want to write the entire data variable of a NetCDF file to the TileDB array. This happens in the write_array helper function. This could be adjusted so that (optionally) the indices passed to the writer helper function could be applied to subset the NetCDF data variable as well as specifying the extents within the TileDB array to which the NetCDF data variable should be written. I'll raise this as a separate issue so that the idea is not lost in this discussion!

eggio · 2021-10-04T22:27:31Z

Thanks for the fast answer with the workaround fot the chunking for tiledb_netcdf.
I tried your suggestion. I generated sub netCDF files from my initial file using CDO.
I get and Error with "writer.append(append_files, unlimited_dims, data_array_name)" though. This is the Error:

"ValueError: not enough values to unpack (expected 1, got 0)"

In "\nctotdb\writers\tiledb.py", line 731, in _dim_points, step, = np.unique(np.diff(points))

I have no idea what the origin of this Error is, could you help me?

my append_files list is just a list with the filenames of the netCDF files and the files are directly in my working directory

DPeterK · 2021-10-08T15:35:53Z

Hi @eggio - it's a little tricky without being able to see your sub NetCDF files, but I wonder if CDO has split them such that they're scalar in the dimension you need to append them along. If we imagine a stylised view of your original NetCDF file and the sub-files that result, you might have ended up with something like the following:

my_file.nc: t - 3, y - 96, x - 172

my_subfile_1.nc: y - 96, x - 172
                 scalar t: 0
my_subfile_2.nc: y - 96, x - 172
                 scalar t: 1
my_subfile_3.nc: y - 96, x - 172
                 scalar t: 2

Scalar coordinates don't behave the same as dimensioned coordinates (they don't have a shape, among other things). That might be sufficient to cause the error you've encountered. In that case, you'll need to promote the scalar coordinate to a length-1 dimension coordinate. Continuing our stylistic view of NetCDF files, this would have the following impact on the shape of all the sub-files (although I'm only showing one here for brevity!):

my_subfile_1.nc: t - 1, y - 96, x - 172

However, tiledb_netcdf will still need some help to append these files, as you can't calculate a step between points from a single point! In that case you'll need to specify a baseline offset when you run the append, so that the append operation knows what to look at to calculate the offset between successive single points. This looks like:

writer.append(append_files, unlimited_dims, data_array_name, baselines=append_files[0])

As an aside, it looks like you're using a previous version of tiledb_netcdf. The latest release includes some significant updates to the append process - it won't definitely solve the issue you're having, but it should make it more likely that the append will function! If you can update, I recommend it 🙂

DPeterK mentioned this issue Sep 30, 2021

[ENH] Write a subset of a NetCDF file to TileDB #75

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking option for TileDBwriter #74

Chunking option for TileDBwriter #74

eggio commented Sep 28, 2021

DPeterK commented Sep 30, 2021 •

edited

Loading

eggio commented Oct 4, 2021

DPeterK commented Oct 8, 2021

Chunking option for TileDBwriter #74

Chunking option for TileDBwriter #74

Comments

eggio commented Sep 28, 2021

DPeterK commented Sep 30, 2021 • edited Loading

eggio commented Oct 4, 2021

DPeterK commented Oct 8, 2021

DPeterK commented Sep 30, 2021 •

edited

Loading