Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cf-coding #7654

Closed
wants to merge 22 commits into from
Closed

cf-coding #7654

wants to merge 22 commits into from

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented Mar 21, 2023

  • xref Saving and loading an array of strings changes datatype to object #7652
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst
  • includes
    • implement coders
    • preserve boolean-dtype within encoding, adapt test
    • determine cf packed data type
    • transform numpy object-dtype strings (vlen) to numpy unicode strings

This should also fix:

@kmuehlbauer
Copy link
Contributor Author

I'll add to whats-new.rst if an updated version is merged.

@kmuehlbauer kmuehlbauer changed the title preserve boolean-dtype within encoding preserve dtypes (bool, vlen string) within encoding Mar 21, 2023
@kmuehlbauer
Copy link
Contributor Author

The one failed run might be spurious. Maybe a re-run of this would work.

@basnijholt
Copy link

basnijholt commented Mar 22, 2023

Thanks a lot for the quick PR!

I can confirm that this fixes

But not int64 -> int32, and <U1 -> O

@kmuehlbauer
Copy link
Contributor Author

@basnijholt Thanks for testing. I can't reproduce this. Here everything works as expected. But I've a slightly different environment (full netcdf4-stack, latest versions of everything).

The tests which check for dtype equality (vlen-case) do not raise in this PR for any backend and array container, so I'm assuming this should work as expected (at least for bool->int8, and <U1 -> O.

As stated over at #7652 I can't reproduce the int64->int32 conversion in any environments I tested so far. I'll have another look at your environment.

Copy link
Contributor Author

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some explanations and concerns. I'd very much appreciate comments and guidance here.

Pinging @shoyer and @max-sixty as authors of relevant code.

xarray/conventions.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Show resolved Hide resolved
xarray/conventions.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
@kmuehlbauer
Copy link
Contributor Author

But not int64 -> int32, and <U1 -> O

@basnijholt I've explained the issues whereabouts over at your issue #7652 (comment).

@kmuehlbauer
Copy link
Contributor Author

XRef: #2040 (comment)

Citing @shoyer from above comment:

The main reason why we don't do any special handling for object arrays currently in xarray is that our conventions coding/decoding system has no way of marking variable length string arrays. We should probably handle this by making a custom dtype like h5py that marks variables length strings using dtype metadata: http://docs.h5py.org/en/latest/special.html#variable-length-strings

Another interesting issue by @shoyer:
#2059

I'm really uncertain if using .astype here is the right way to go. Any comments appreciated.

@kmuehlbauer
Copy link
Contributor Author

And yet another related issue: #1647

Currently both netcdf/h5netcdf are able to set a _FillValue for VLEN strings, but for the numpy-side "NaN" can only be handled with dtype=object. Maybe it's time to consolidate string handling in xarray. But that should be taken care of in a separate issue / feature branch.

@dcherian
Copy link
Contributor

Any comments appreciated.

Let's discuss at next week's meeting. @kmuehlbauer can you make it? 9.30am MT Wed 29 Mar 2023

@kmuehlbauer
Copy link
Contributor Author

make

That's 15 UTC, or 17 CEST (my local time). Should work for me. I'll try to collect all available information on that topic and the current status.

@basnijholt
Copy link

Just leaving a note here. I would expect that the datatype that was saved, is the datatype that is loaded. So preferably if I save a string array of e.g., type <U5, I expect it would still be <U5 when loaded, not suddenly HDF5 VLEN types.

Thanks again @kmuehlbauer for digging into this problem and all your work! 😄

@kmuehlbauer kmuehlbauer force-pushed the preserve-bool-dtype branch 2 times, most recently from 64ef0b9 to b5dad00 Compare March 31, 2023 11:03
@kmuehlbauer kmuehlbauer changed the title preserve dtypes (bool, vlen string) within encoding cf-coding Mar 31, 2023
@kmuehlbauer
Copy link
Contributor Author

@dcherian @basnijholt

After the dev-meeting I've taken a step back and first implemented the coders as mentioned in @shoyer's ToDo.

I've fixed the one bool->int issue and it now derives the dtype for ScaleOffset coding from scale_factor add_offset.

I've improved some test with regard to the scale/offset issue.

I'll concentrate on the string fillvalue issues in a follow up PR.

xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/coding/variables.py Outdated Show resolved Hide resolved
xarray/coding/variables.py Outdated Show resolved Hide resolved
xarray/coding/variables.py Outdated Show resolved Hide resolved
@kmuehlbauer
Copy link
Contributor Author

kmuehlbauer commented Apr 1, 2023

@dcherian @Illviljan Thanks for the first round of review. I've rebased everything on latest main. Now the code moving from conventions.py to coding.variable.py is correct. I've also removed the functions which have been converted to VariableCoders and adapted the tests.

To sum up this PR, it does:

@kmuehlbauer
Copy link
Contributor Author

@Illviljan I'm not able to figure out the typing if I want to use Data-types as functions to convert python numbers to array scalars. If you have any suggestion how to fix this, please let me know.

def _choose_float_dtype(dtype: np.dtype, has_offset: bool) -> type[np.floating[Any]]:
"""Return a float dtype that can losslessly represent `dtype` values."""
# Keep float32 as-is. Upcast half-precision to single-precision,
def _choose_float_dtype(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mankoff do you have time to take a look here please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dcherian . I'm not sure what to look for. I will link to my open-but-stale PR that tried to start addressing this/similar issues: #6812

Perhaps also relevant are my last few comments on #2304 (see #2304 (comment) ). The problem for me is that (1) the CF standards are ambiguously defined and (2) xarray needs to address the many use-cases where the CF standards are not followed (usually this means different data types).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Does this PR fix your original issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mankoff, I'll have a look at your extensive notes over there and try to come up with aomething.

@kmuehlbauer
Copy link
Contributor Author

Still hunting for corner cases and issues inside encode_cf_variable/decode_cf_variable.

It looks like I already see some light again. Not sure, if this is the last iteration, but the testsuite is still running green with added and enhanced tests, which is not that bad.

Unfortunately #2304 is still an issue for now. I'll clarify that later with an added test.

@kmuehlbauer
Copy link
Contributor Author

@dcherian Just a heads-up: I find this PR getting more and more involved at different parts of the machinery and hard to follow for reviewers. I'll split this up and start with the more or less undisputed changes.

@kmuehlbauer
Copy link
Contributor Author

As explained I've created two PR (#7719 and #7720) for the "easy" changes from this PR. Would be great, if those could go in fast. Thanks!

@kmuehlbauer kmuehlbauer marked this pull request as draft May 8, 2023 18:08
@kmuehlbauer
Copy link
Contributor Author

I've converted to draft for now, as I'm still evaluating solutions for the CF encoding/decoding.

@kmuehlbauer
Copy link
Contributor Author

I'll close this one. Most things have been addressed in other PR's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants