Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Write a subset of a NetCDF file to TileDB #75

Open
DPeterK opened this issue Sep 30, 2021 · 1 comment
Open

[ENH] Write a subset of a NetCDF file to TileDB #75

DPeterK opened this issue Sep 30, 2021 · 1 comment

Comments

@DPeterK
Copy link
Contributor

DPeterK commented Sep 30, 2021

As per #74, you may wish to write a subset of a NetCDF file to a TileDB array rather than the whole file. Possible use-cases:

  • the NetCDF file is too large to fit into system memory as a whole
  • you only wish to store a subset of the NetCDF file's data variable in a TileDB array.

One way this could be achieved is by optionally indexing the NetCDF data variable with the write indices for writing the NetCDF data to the TileDB array in write_array. For example:

def write_array(..., subset_data_var=False):
    ...
    with tiledb.open(array_filename, 'w', ctx=ctx) as A:
        if subset_data_var:
            A[write_indices] = data_var[write_indices]
        else:
            A[write_indices] = data_var[...]

I expect there will be 🐉 with keeping track of the indices, particularly ensuring that the NetCDF coordinate variables are indexed in line with the indices used for the NetCDF data variable.

@eggio
Copy link

eggio commented Oct 4, 2021

Thanks for your suggesttion.
Indeed i already tried it kind of this way. I used xarray to read the netCDF file, then i get the data values for variable 'tas' something like:

data = xr_ds['tas'][0:100].to_numpy()
data = data.flatten()

In my case i have dimensions time, x, y. here i get the first 100 entries for time = 0;100 for all x and y (climate data).
For the dimension values i use np.tile() and np.flatten() to generate a huge 2D array with all the coordinates the data values should get written to. Finally i wirte it to tiledb using this line:

with tiledb.open(tdb_name, mode = 'w') as write_array:
write_array[tuple(dim_values)] = {'tas': data}

In my case when i use most of my memory the shapes of dim_values and data are (18272925, 3) and (18272925,)
The writing process alone takes roughly 50s. This is roughly 1/500 of the whole data i want to write to tiledb.

I also saw in your code for TileDBWriter that you are using the same line for the writing, do you think that there is a change this can be speed up or done in a faster manner? Otherwise adding the chunking to tiledb_netcdf wouldn't speed up the writing for met that much i think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants