Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

classify_variables() fails to detect 'tasmax' data var? #33

Open
CloudNiner opened this issue May 1, 2020 · 3 comments
Open

classify_variables() fails to detect 'tasmax' data var? #33

CloudNiner opened this issue May 1, 2020 · 3 comments

Comments

@CloudNiner
Copy link

Hi -- found your library and am looking to use it to expedite writing some netcdf files to tiledb arrays for testing.

I'm using the LOCA climate dataset as my test and started with this single file: s3://nasanex/LOCA/GFDL-ESM2G/16th/rcp85/r1i1p1/tasmax/tasmax_day_GFDL-ESM2G_rcp85_r1i1p1_20500101-20501231.LOCA_2016-04-02.16th.nc

When I load that file with this code:

input_netcdf_file = "path/to/file/on/disk/loca.nc"
data_model = NCDataModel(input_netcdf_file)
data_model.classify_variables()
data_model.get_metadata()

the program crashes with the error:

Traceback (most recent call last):
  File "netcdf_to_tiledb.py", line 70, in <module>
    main()
  File "netcdf_to_tiledb.py", line 45, in main
    data_model.get_metadata()
  File "/Users/cloudniner/src/tiledb-benchmarks/netcdf/tiledb_netcdf/nctotdb/data_model.py", line 217, in get_metadata
    raise ValueError(f'Expected to find at least one data var, but found {n_data_vars}.')
ValueError: Expected to find at least one data var, but found 0.

For this file, I would expect the data var to be tasmax which looks like this:

>>> hasattr(nc.variables.get("tasmax"), 'coordinates')
False
>>> nc.variables.get("tasmax").dimensions
('time', 'lat', 'lon')

tasmax doesn't have a coordinates attr as read by netCDF4 lib so it fails the check here: https://github.com/informatics-lab/tiledb_netcdf/blob/master/nctotdb/data_model.py#L64

I attempted to hack my way past this, but that leads to another downstream error:

    data_model = NCDataModel(args.input_netcdf_file)
    data_model.classify_variables()
    # HACK: lib fails to detect our data var, so we manually set it before calling get_metadata
    # data_model.data_var_names = ['tasmax']
    data_model.get_metadata()

    tdb_writer = TDBWriter(data_model, args.output_tiledb_array)
    """
    This currently crashes with error:
    Converting ./data/tasmax_day_GFDL-ESM2G_rcp85_r1i1p1_20500101-20501231.LOCA_2016-04-02.16th.nc to ./data/tasmax_day_GFDL-ESM2G_rcp85_r1i1p1_20500101-20501231.LOCA_2016-04-02.16th.tiledb
        Traceback (most recent call last):
        File "netcdf_to_tiledb.py", line 51, in <module>
            main()
        File "netcdf_to_tiledb.py", line 47, in main
            tdb_writer.create_domains()
        File "/Users/cloudniner/src/tiledb-benchmarks/netcdf/tiledb_netcdf/nctotdb/writers.py", line 320, in create_domains
            self.create_domain_arrays(domain_coords, group_dirname, coords=True)
        File "/Users/cloudniner/src/tiledb-benchmarks/netcdf/tiledb_netcdf/nctotdb/writers.py", line 230, in create_domain_arrays
            tiledb.Array.create(array_filename, schema)
        File "tiledb/libtiledb.pyx", line 3150, in tiledb.libtiledb.Array.create
        File "tiledb/libtiledb.pyx", line 413, in tiledb.libtiledb._raise_ctx_err
        File "tiledb/libtiledb.pyx", line 398, in tiledb.libtiledb._raise_tiledb_error
        tiledb.libtiledb.TileDBError: [TileDB::IO] Error: Cannot create directory '/Users/cloudniner/src/tiledb-benchmarks/netcdf/data/tasmax_day_GFDL-ESM2G_rcp85_r1i1p1_20500101-20501231.LOCA_2016-04-02.16th.tiledb/tasmax_day_GFDL-ESM2G_rcp85_r1i1p1_20500101-20501231.LOCA_2016-04-02.16th/data/tasmax_day_GFDL-ESM2G_rcp85_r1i1p1_20500101-20501231.LOCA_2016-04-02.16th.tiledb/tasmax_day_GFDL-ESM2G_rcp85_r1i1p1_20500101-20501231.LOCA_2016-04-02.16th/domain_0/time'; No such file or directory
    """
    tdb_writer.create_domains()

I've installed all the necessary libraries with conda, I'm currently using tiledb 1.7.7, tiledb-py 0.5.9, iris 2.4.0 on MacOS 10.15.4

@DPeterK
Copy link
Contributor

DPeterK commented Jun 2, 2020

Hi @CloudNiner - thanks for getting in touch! Apologies for a very slow reply, shifting work pressures and then a global pandemic have rather limited the time I've had to spend working on this library...

Looks like you've successfully found the root cause of the issue you hit. Unfortunately due to the vagaries of the NetCDF spec it's very hard to find a fully-accurate test for "is this variable a data variable?" and the one I settled on is "this variable has a coordinates attribute" - as I used "this variable has a dimensions attribute" to define the coordinate variables in the NetCDF file - but, again, this is my convention only as there is no such convention in the NetCDF spec.

I'm not sure what I can do to change the classification code to help out with this. I can't say "multiple values in the dimensions attribute" because then it would mis-classify multi-dimensional coordinate variables as data variables instead! I also can't insist that all your NetCDF files have a coordinates attribute on their data variables because (a) I can't back that up from a file format spec and (b) it means extra effort for you (although the ncatted command-line utility would probably enable you to do this if you did want to follow that route) and that seems unfair.

I'm surprised that manually setting the data variables then calling get_metadata() didn't solve this issue. It might be worth checking that data_model.domains and data_model.domain_varname_mapping are set and look sensible after calling get_metadata() - I think I'd expect the following (it's been a while since I looked at the code):

>>> print(data_model.domains)
('time', 'lat', 'lon')
>>> print(data_model.domain_varname_mapping)
{('time', 'lat', 'lon'): 'tasmax'}

Note the second one might actually have domain_0 in it somewhere instead (I'm getting that from the directory that TileDB was trying to create). In that case you're not using the most up-to-date version of the library, which dispenses with the unhelpful iteration-based domain names and replaced them with a string of the dimensions covered by the domain. There's a small chance that updating the library will solve this for you.

If you don't get this then there's still something awry in the metadata classification, which will be the next thing to run the magnifying glass over...

@DPeterK DPeterK mentioned this issue Jun 2, 2020
@DPeterK
Copy link
Contributor

DPeterK commented Jun 5, 2020

I just had another thought about this - when you ran the classification on this file did you get a note that tasmax was unclassified at the end of the classification process? If not it means tasmax was classified, but incorrectly - probably as an aux coordinate. That duplicated, incorrect classification may be the actual cause of the later error when creating the writer class.

You could try this (assuming it has classified incorrectly as an aux coordinate):

input_netcdf_file = "path/to/file/on/disk/loca.nc"
data_var_name = 'tasmax'

data_model = NCDataModel(input_netcdf_file)
data_model.classify_variables()

data_model.data_var_names = [data_var_name]
data_model.aux_coord_names.pop(data_model.aux_coord_names.index(data_var_name))

data_model.get_metadata()

@CloudNiner
Copy link
Author

Thanks for the detailed responses and no worries about the delay!

First I updated to the latest master, ae98ce2

It looks like tasmax is indeed misclassified as an aux variable. So, I constructed the data model and attempted to write with the following code:

from tiledb_netcdf.nctotdb import TileDBWriter                     
from tiledb_netcdf.nctotdb import NCDataModel                      
data_model = NCDataModel('./data/tasmax_day_GFDL-ESM2G_rcp85_r1i1p1_20500101-20501231.LOCA_2016-04-02.16th.nc')
with data_model.open_netcdf():                                     
    data_model.classify_variables()                                                                          
    data_model.data_var_names = ['tasmax']                                                                                               
    data_model.aux_coord_names = []                                                                                                      
    data_model.get_metadata()
                                                                                                                                         
tiledb_save_path = './data/tasmax.tiledb'                              
tiledb_name = 'tasmax'                                                     
writer = TileDBWriter(data_model, array_filepath=tiledb_save_path, array_name=tiledb_name)
writer.create_domains()

Before the writer is created, the data model looks like this

>>> data_model.domains
[('time', 'lat', 'lon')]                                                                                                                     
>>> data_model.domain_varname_mapping                                                                                                        
{('time', 'lat', 'lon'): ['tasmax']}
>>> data_model.dim_coord_names                                        
['lon', 'lat', 'time', 'lon', 'lat', 'time']                                                                                                 
>>> data_model.data_var_names                                                                                                                
['tasmax']             
>>> data_model.aux_coord_names                                                                                                               
[]                                                                                                                                           
>>> data_model.variable_names                                                                                                                
['lon', 'lon_bnds', 'lat', 'lat_bnds', 'time', 'time_bnds', 'tasmax']
>>> data_model.varname_domain_mapping
{'tasmax': ('time', 'lat', 'lon')}

Unfortunately the writer crashes with this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/andrew/src/tiledb-benchmarks/netcdf/tiledb_netcdf/nctotdb/writers.py", line 613, in create_domains
    self.populate_multiattr_array(data_array_name, domain_var_names, domain_name)
  File "/Users/andrew/src/tiledb-benchmarks/netcdf/tiledb_netcdf/nctotdb/writers.py", line 547, in populate_multiattr_array
    A.meta[key] = value
  File "tiledb/libmetadata.pyx", line 428, in tiledb.libtiledb.Metadata.__setitem__
  File "tiledb/libmetadata.pyx", line 215, in tiledb.libtiledb.put_metadata
  File "tiledb/libmetadata.pyx", line 82, in tiledb.libtiledb.pack_metadata_val
ValueError: Unsupported item type '<class 'tuple'>'

I tossed some debug output in there and it looks like that crash happens when I attempt to write the following key + value:

_FillValue = 1.0000000150474662e+30

Weird because that's a number not a tuple...in fact, its a numpy.float32:

>>> type(rootgrp.variables['tasmax'].getncattr('_FillValue'))
<class 'numpy.float32'>

This appears to be an issue with the underlying tiledbpy library -- I'd expect many netcdf files to have attributes that are read as numpy types. If I update https://github.com/informatics-lab/tiledb_netcdf/blob/master/nctotdb/writers.py#L543 to A.meta[key] = str(value), which force casts all metadata values to string I get past that error but then get another one:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/andrew/src/tiledb-benchmarks/netcdf/tiledb_netcdf/nctotdb/writers.py", line 613, in create_domains
    self.populate_multiattr_array(data_array_name, domain_var_names, domain_name)
  File "/Users/andrew/src/tiledb-benchmarks/netcdf/tiledb_netcdf/nctotdb/writers.py", line 554, in populate_multiattr_array
    grid_mapping_string = self._get_grid_mapping(data_vars[0])
  File "/Users/andrew/src/tiledb-benchmarks/netcdf/tiledb_netcdf/nctotdb/writers.py", line 258, in _get_grid_mapping
    grid_mapping_name = data_var.getncattr("grid_mapping")
  File "netCDF4/_netCDF4.pyx", line 4166, in netCDF4._netCDF4.Variable.getncattr
  File "netCDF4/_netCDF4.pyx", line 1407, in netCDF4._netCDF4._get_att 
  File "netCDF4/_netCDF4.pyx", line 1887, in netCDF4._netCDF4._ensure_nc_success
AttributeError: NetCDF: Attribute not found

Looks like its looking for the grid_mapping ncattr on the tasmax var, which is not on this particular data var in this netcdf file: data_model.grid_mapping = [].

So, again it looks like we're back to the trappings of the wonderful self describing NetCDF file format in that a thing you're looking for doesn't exist in this particular file. If its not necessary that the data var have a grid mapping attr, perhaps the exception can just be caught and writing can continue, but I'm not familiar enough with the tiledb format to make that call quite yet. Thoughts on that one?

If it seems to you that the metadata writing issue above is a problem with the underlying library, I'll open a separate issue over there since I don't see any mention of a similar issue fixed in their changelog between 0.5.9 and 0.62.

In any case, I saw you updated the README with some notes about how to manually reclassify and use the TileDBWriter, that was super helpful after revisiting this some time later and getting back up to speed. Nice docs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants