Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved CF decoding #6812

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions xarray/coding/variables.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ def _choose_float_dtype(dtype, has_offset):
# Sensitivity analysis can be tricky, so we just use a float64
# if there's any offset at all - better unoptimised than wrong!
if not has_offset:
return np.float32
return np.float64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code matches the comments. It would be clearer if written as

if has_offset:
    return np.float64
else:
    return np.float32

Without your edits, if there is an offset the condition does not trigger and we return np.float64 later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing this pull request. FYI my original comment (later edited) said:

Also, before this is merged, I'd like to suggest a larger change, and possibly discuss architecture here a bit (if appropriate). Specifically, I'd like to change the _choose_float_dtype function, and the two calls to it, to pass in the dtype of scale_factor and add_offset, in addition to the data dtype. This function should then return the dtype of the highest precision of three.

Currently, _choose_float_dtype returns float32 if the data is float16 or float32, even if the scale_factor dtype is float64.

Based on your comment, I think my original intuition - that this function needs a large rewrite - is correct. I'll look into this and submit additional commits to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

# For all other types and circumstances, we just use float64.
# (safe because eg. complex numbers are not supported in NetCDF)
return np.float64
Expand Down Expand Up @@ -269,7 +269,7 @@ def decode(self, variable, name=None):
if "scale_factor" in attrs or "add_offset" in attrs:
scale_factor = pop_to(attrs, encoding, "scale_factor", name=name)
add_offset = pop_to(attrs, encoding, "add_offset", name=name)
dtype = _choose_float_dtype(data.dtype, "add_offset" in attrs)
dtype = _choose_float_dtype(data.dtype, "add_offset" in encoding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this fixed one issue, but the original issue still remains because we still aren't looking at the dtype of scale_factor and add_offset as recommended by the conventions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note - I think the conventions referred to above are: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html or

If the scale_factor and add_offset attributes are of the same data type as the associated variable, the unpacked data is assumed to be of the same data type as the packed data. However, if the scale_factor and add_offset attributes are of a different data type from the variable (containing the packed data) then the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int. It is not advised to unpack an int into a float as there is a potential precision loss.

if np.ndim(scale_factor) > 0:
scale_factor = np.asarray(scale_factor).item()
if np.ndim(add_offset) > 0:
Expand Down
15 changes: 14 additions & 1 deletion xarray/tests/test_coding.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ def test_coder_roundtrip() -> None:
assert_identical(original, roundtripped)


@pytest.mark.parametrize("dtype", "u1 u2 i1 i2 f2 f4".split())
@pytest.mark.parametrize("dtype", "f2 f4".split())
def test_scaling_converts_to_float32(dtype) -> None:
original = xr.Variable(
("x",), np.arange(10, dtype=dtype), encoding=dict(scale_factor=10)
Expand All @@ -109,6 +109,19 @@ def test_scaling_converts_to_float32(dtype) -> None:
assert roundtripped.dtype == np.float32


@pytest.mark.parametrize("dtype", "u1 u2 i1 i2".split())
def test_scaling_converts_to_float64(dtype) -> None:
original = xr.Variable(
("x",), np.arange(10, dtype=dtype), encoding=dict(scale_factor=10)
)
coder = variables.CFScaleOffsetCoder()
encoded = coder.encode(original)
assert encoded.dtype == np.float64
roundtripped = coder.decode(encoded)
assert_identical(original, roundtripped)
assert roundtripped.dtype == np.float64


@pytest.mark.parametrize("scale_factor", (10, [10]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that scale_factor or add_offset can be an array type per the CF spec, so I changed this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kinds of things tend to happen though. Since we have tested for it, we should just keep it around.

@pytest.mark.parametrize("add_offset", (0.1, [0.1]))
def test_scaling_offset_as_list(scale_factor, add_offset) -> None:
Expand Down