[ENH] Write a subset of a NetCDF file to TileDB #75

DPeterK · 2021-09-30T08:35:28Z

As per #74, you may wish to write a subset of a NetCDF file to a TileDB array rather than the whole file. Possible use-cases:

the NetCDF file is too large to fit into system memory as a whole
you only wish to store a subset of the NetCDF file's data variable in a TileDB array.

One way this could be achieved is by optionally indexing the NetCDF data variable with the write indices for writing the NetCDF data to the TileDB array in write_array. For example:

def write_array(..., subset_data_var=False):
    ...
    with tiledb.open(array_filename, 'w', ctx=ctx) as A:
        if subset_data_var:
            A[write_indices] = data_var[write_indices]
        else:
            A[write_indices] = data_var[...]

I expect there will be 🐉 with keeping track of the indices, particularly ensuring that the NetCDF coordinate variables are indexed in line with the indices used for the NetCDF data variable.

The text was updated successfully, but these errors were encountered:

eggio · 2021-10-04T22:16:27Z

Thanks for your suggesttion.
Indeed i already tried it kind of this way. I used xarray to read the netCDF file, then i get the data values for variable 'tas' something like:

data = xr_ds['tas'][0:100].to_numpy()
data = data.flatten()

In my case i have dimensions time, x, y. here i get the first 100 entries for time = 0;100 for all x and y (climate data).
For the dimension values i use np.tile() and np.flatten() to generate a huge 2D array with all the coordinates the data values should get written to. Finally i wirte it to tiledb using this line:

with tiledb.open(tdb_name, mode = 'w') as write_array:
write_array[tuple(dim_values)] = {'tas': data}

In my case when i use most of my memory the shapes of dim_values and data are (18272925, 3) and (18272925,)
The writing process alone takes roughly 50s. This is roughly 1/500 of the whole data i want to write to tiledb.

I also saw in your code for TileDBWriter that you are using the same line for the writing, do you think that there is a change this can be speed up or done in a faster manner? Otherwise adding the chunking to tiledb_netcdf wouldn't speed up the writing for met that much i think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Write a subset of a NetCDF file to TileDB #75

[ENH] Write a subset of a NetCDF file to TileDB #75

DPeterK commented Sep 30, 2021

eggio commented Oct 4, 2021

[ENH] Write a subset of a NetCDF file to TileDB #75

[ENH] Write a subset of a NetCDF file to TileDB #75

Comments

DPeterK commented Sep 30, 2021

eggio commented Oct 4, 2021