Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax nanosecond datetime restriction in CF time coding #9618

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented Oct 13, 2024

This is another attempt to resolve #7493. This goes a step further than #9580.

The idea of this PR is to automatically infer the needed resolutions for decoding/encoding and only keep the constraints pandas imposes ("s" - lowest resolution, "ns" - highest resolution). There is still the idea of a default resolution, but this should only take precedence if it doesn't clash with the automatic inference. This can be discussed, though. Update: I've implemented a first try to have default resolution on decode, which will override the current inferred resolution only to higher resolution (eg. 's' -> 'ns').

For sanity checking, and also for my own good, I've created a documentation page on time-coding in the internal dev section. Any suggestions (especially grammar) or ideas for enhancements are much appreciated.

There still might be room for consolidation of functions/methods (mostly in coding/times.py), but I have to leave it alone for some days. I went down that rabbit hole and need to relax, too 😬.

Looking forward to get your insights here, @spencerkclark, @ChrisBarker-NOAA, @pydata/xarray.

@kmuehlbauer
Copy link
Contributor Author

Nice, mypy 1.12 is out and breaks our typing, 😭.

@TomNicholas
Copy link
Member

Nice, mypy 1.12 is out and breaks our typing, 😭

Can we pin it in the CI temporarily?

@TomNicholas TomNicholas mentioned this pull request Oct 14, 2024
4 tasks
@kmuehlbauer
Copy link
Contributor Author

Can we pin it in the CI temporarily?

Yes, 1.11.2 was the last version.

@kmuehlbauer kmuehlbauer marked this pull request as ready for review October 14, 2024 18:05
@kmuehlbauer
Copy link
Contributor Author

This is now ready for a first round of review. I think this is already in a quite usable state.

But no rush, this should be thoroughly tested.

@spencerkclark
Copy link
Member

Sounds good @kmuehlbauer! I’ll try and take an initial look this weekend.

pre-commit-ci bot and others added 4 commits October 18, 2024 05:34
… more carefully, for now using pd.Series to covert `OMm` type datetimes/timedeltas (will result in ns precision)
@kmuehlbauer
Copy link
Contributor Author

@spencerkclark Thanks for taking the time to review. I've tried to be explicit with comments and maybe also my doc ramblings will help. I've also tried to fixup the current docs on time handling.

There are still some issues when creating datetime/timedelta from scratch which currently flow finally through pd.Series if not converted by other means beforehand. So, there is still some glitches which have to be ironed out.

Anyway, thanks and good luck!

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kmuehlbauer—I think this is coming together well. I'll provide a more detailed review eventually. In my mind there are still some high-level things to consider:

  • To keep breaking changes from happening abruptly, we may want to have a period where the default decoding time_resolution remains "ns" before changing to something else. From my initial scan of how you modified the tests—maybe with foresight—I think it would be straightforward to switch back to defaulting to "ns" precision, which is nice. However, we still would want a way to test non-nanosecond precision behavior. Is this what you have been thinking about or did you have something else in mind? I'm happy to hear other points of view. At least in the case of casting in the Variable constructor we have been warning about this for a while.
  • How to best handle choosing a precision when decoding times encoded as floats is still an open question. I left a couple thoughts below, but I think it may need more discussion.
  • We should think more about how we handle casting when constructing Variable objects with datetime.datetime or datetime.timedelta objects—our current approach may not be up to the task for objects that cannot be represented with nanosecond precision, though this may be considered an upstream issue.

Something else to keep on our radar is the discussion in pandas-dev/pandas#58989. It looks like pandas is reconsidering how to handle precision inference in various contexts—I think what they are leaning towards is pretty similar to what you have currently for decoding, which is good.

Comment on lines +567 to +568
# todo: decide how to work with this
# as initialized by string pd.Timestamp is defined only from year -9999-01-01 to 9999-12-31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you were leaning towards letting pandas infer the precision from the string representation of the dates, but that is not an option because of this? I think what you have is reasonable, since cftime objects have microsecond precision.

Copy link
Member

@spencerkclark spencerkclark Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't realize this actually fails too:

>>> pd.Timestamp(10000, 1, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "timestamps.pyx", line 1850, in pandas._libs.tslibs.timestamps.Timestamp.__new__
ValueError: year 10000 is out of range

Maybe we should convert through datetime.datetime now that we don't necessarily need to worry about overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This moves probably through the same code path ("string" -> (years, months, days, etc) -> Timestamp()).

This is a little bit annoying, but given the specialties I'll not blame pandas in any way. The move to non-nanosecond resolution was a major undertaking and issues will be ironed out as they appear, I think.

Yes, we can try to as you suggest.

Since relaxing the resolution this enhances the range to several hundreds of thousands of centuries with microsecond representation. ``NaT`` will be at ``np.iinfo("int64").min`` for all of the different representations.

.. warning::
When initialized with a datetime string this is only defined from ``-9999-01-01`` to ``9999-12-31``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so pandas currently does not support date string indexing or creating date ranges with strings with years greater than 10000. I didn't realize this. It's by no means a blocker (it's an upstream issue), but it would be nice to have at some point.

I guess that means we have to decode times with reference dates with years greater than or equal to 10000 with cftime for the time being, which locks us into microsecond precision in that scenario without doing something more clever in cftime_to_nptime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so it seems. I tried, but no neat solution came to mind.


2. ``standard``/``gregorian`` calendar and ``proleptic_gregorian`` are equivalent for any dates and reference times >= ``1582-10-15``. First the reference time is checked and any timezone information stripped off and in a second step, the minimum and maximum ``values`` are checked if they can be represented in the current reference time resolution. At the same time integer overflow would be caught. For ``standard``/``gregorian`` calendar the dates are checked to be >= ``1582-10-15``. If anything fails, the decoding is done with ``cftime``).

3. As the time unit (here ``seconds``) and the resolution of the reference time ``1992-10-8 15:15:42.5 -6:00`` (here ``milliseconds``) might be different, this has to be aligned to the higher resolution (retrieve new unit). This is done by multiplying the ``values`` by the ratio of nanoseconds per time unit and nanoseconds per reference time unit. To not break consistency for ``NaT`` a mask is kept and re-introduced after the multiplication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still worry about times encoded as floats. The following example currently truncates precision without warning:

>>> values = np.array([0, 0.25, 0.5, 0.75, 1.0], dtype="float64")
>>> units = "seconds since 2000-01-01"
>>> calendar = "standard"
>>> xr.coding.times.decode_cf_datetime(values, units, calendar)
array(['2000-01-01T00:00:00', '2000-01-01T00:00:00',
       '2000-01-01T00:00:00', '2000-01-01T00:00:00',
       '2000-01-01T00:00:01'], dtype='datetime64[s]')

I don't know what the best solution is. Always using nanosecond precision for floats would be the safest / simplest option, but that is a bit heavy handed (and is probably not what most people would want—times encoded with floats often don't need that level of precision, and it would limit the range of dates that were representable). We could make the user-configurable minimum precision used for decoding floats separate from that used for integers, but that would introduce another parameter in set_options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, special casing floats, also another parameter as suggested is a good step forward. I'll think about that more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For floats, I think we'll need to pick some default time resolution. Microseconds is probably a sane choice, certainly better than nanoseconds?

Possibly it also makes sense to issue an warning in these cases (unless explicitly silenced via some form of Xarray decoding option) so that users know what is going on.

@@ -137,7 +132,7 @@ Conversion between non-standard calendar and to/from pandas DatetimeIndexes is
facilitated with the :py:meth:`xarray.Dataset.convert_calendar` method (also available as
:py:meth:`xarray.DataArray.convert_calendar`). Here, like elsewhere in xarray, the ``use_cftime``
argument controls which datetime backend is used in the output. The default (``None``) is to
use `pandas` when possible, i.e. when the calendar is standard and dates are within 1678 and 2262.
use `pandas` when possible, i.e. when the calendar is standard and dates starting with 1582-10-15.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should note that with a proleptic Gregorian calendar there is no date restriction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is only a limitation for standard/gregorian. This needs change.

@@ -32,6 +32,7 @@
"use_numbagg",
"use_opt_einsum",
"use_flox",
"time_resolution",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we should have some way of noting that this only applies to NumPy dates (not cftime), and is only relevant to decoding. Would a more precise name be something like "minimum_numpy_time_decoding_precision", since setting this doesn't guarantee the precision of the decoded times?

It's kind of verbose though; do you have any thoughts on a clearer name? I guess one edge case where this is not necessarily the minimum precision is that you could end up with microsecond precision if we fall back to decoding times with cftime (due to dates being out of range) and convert back to NumPy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, as this switch/parameter will not be used very often in the code, we can come up with a more speaking name. I'd go with your suggestion, if we do not come up with anything better.

],
)
def test_infer_timedelta_units(deltas, expected) -> None:
# todo: why testing, the same thing four times?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh good catch—it looks like this was unintentionally introduced in #9417. The other three originally were:

        (pd.to_timedelta(["1h", "1 day 1 hour"]), "hours"),
        (pd.to_timedelta(["1m", "2m", np.nan]), "minutes"),
        (pd.to_timedelta(["1m3s", "1m4s"]), "seconds"),

Maybe we should restore these in a small separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, copy&pasta error, I see. Yes, should be fixed in main, separately.

@@ -298,11 +246,12 @@ def convert_non_numpy_type(data):
data = utils.to_0d_object_array(data)

if isinstance(data, pd.Timestamp):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scalar handling logic is quite old—I think we might be able to defer to _possibly_convert_objects, which has the virtue that it will let pandas decide how to handle these and be consistent with how lists of these objects will be converted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great if this could all flow through _possibly_convert_objects.

Comment on lines +251 to +254
if isinstance(data, datetime):
data = np.datetime64(data)
if isinstance(data, timedelta):
data = np.timedelta64(getattr(data, "value", data), "ns")
data = np.timedelta64(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pandas still converts scalar datetime.datetime and datetime.timedelta objects to nanosecond precision values instead of microsecond precision values. Should we let pandas do that through _possibly_convert_objects below, or short circuit that (as you have it implemented currently) here?

I maybe lean towards the former, since it maintains the existing behavior, and is consistent with how things are handled in the case of lists or arrays of datetime.datetime or datetime.timedelta objects, though I agree in a vacuum the latter makes more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But unfortunately our approach using pd.Series to handle casting breaks down in the case of objects that cannot be represented with nanosecond precision, e.g.:

>>> pd.Series(datetime.datetime(1, 1, 1))
0    0001-01-01 00:00:00
dtype: object

In other words we get an object-dtype Series. I wonder if there is an open pandas issue for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to drop pd.Series here, but failed. So I'm open to any solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pandas still converts scalar datetime.datetime and datetime.timedelta objects to nanosecond precision values instead of microsecond precision values

This behavior seems quite bad from pandas. Hopefully there's willingness to fix it upstream?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree, we should inquire about it if there is not an open issue already.

Comment on lines +102 to +117
try:
pd.to_datetime([np.datetime64(1901901901901, "s")])
except Exception as err:
print("Raises:", err)
try:
pd.to_datetime(1901901901901, unit="s")
except Exception as err:
print("Raises:", err)
try:
pd.to_datetime([1901901901901], unit="s")
except Exception as err:
print("Raises:", err)
try:
pd.to_datetime(np.array([1901901901901]), unit="s")
except Exception as err:
print("Raises:", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these errors are not that surprising given that pd.to_datetime assumes you want to convert to nanosecond precision if provided with numeric values. We would need some way of specifying we wanted a different precision, i.e. pandas-dev/pandas#49060 (here the unit refers to how to interpret the numerical value).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was more of documenting this for me, because I've tripped over this quite some time ;-)

Comment on lines +68 to +69
.. warning::
Input data with resolution higher than ``'ns'`` (eg. ``'ps'``, ``'fs'``, ``'as'``) is truncated (not rounded) at the ``'ns'``-level. This is currently broken for the ``'ps'`` input, where it is interpreted as ``'ns'``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for investigating this thoroughly—I agree there is some strange behavior here. It could be worth reporting to pandas if there are not issues already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create an issue on the pandas tracker, when I'm back next week.

@kmuehlbauer
Copy link
Contributor Author

Many thanks @spencerkclark!

* To keep breaking changes from happening abruptly, we may want to have a period where the default decoding `time_resolution` remains `"ns"` before changing to something else.  From my initial scan of how you modified the tests—maybe with foresight—I think it would be straightforward to switch back to defaulting to `"ns"` precision, which is nice.  However, we still would want a way to test non-nanosecond precision behavior.  Is this what you have been thinking about or did you have something else in mind?  I'm happy to hear other points of view.  At least in the case of casting in the Variable constructor we have been warning about this for a while.

Yes, I agree, the best way forward is to keep current behaviour but at the same time allow users to opt-in to relaxed processing. I think this is more or less a straightforward change in code and in the tests.

* How to best handle choosing a precision when decoding times encoded as floats is still an open question.  I left a couple thoughts below, but I think it may need more discussion.

Yes, we would need to special cases floats. It seems I was not thinking out all issues.

* We should think more about how we handle casting when constructing Variable objects with `datetime.datetime` or `datetime.timedelta` objects—our current approach may not be up to the task for objects that cannot be represented with nanosecond precision, though this may be considered an upstream issue.

This is where I didn't find a neat solution, so we either can search along or just accept that upstream has some issues.

Something else to keep on our radar is the discussion in pandas-dev/pandas#58989. It looks like pandas is reconsidering how to handle precision inference in various contexts—I think what they are leaning towards is pretty similar to what you have currently for decoding, which is good.

Great, I wasn't aware of that discussion. Good to see it goes into a similar direction.

Due to traveling I will not have time to further work on this until next week, also might not reply in timely manner.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this looks like a big step in the right direction. Thanks for tackling this, Kai!


2. ``standard``/``gregorian`` calendar and ``proleptic_gregorian`` are equivalent for any dates and reference times >= ``1582-10-15``. First the reference time is checked and any timezone information stripped off and in a second step, the minimum and maximum ``values`` are checked if they can be represented in the current reference time resolution. At the same time integer overflow would be caught. For ``standard``/``gregorian`` calendar the dates are checked to be >= ``1582-10-15``. If anything fails, the decoding is done with ``cftime``).

3. As the time unit (here ``seconds``) and the resolution of the reference time ``1992-10-8 15:15:42.5 -6:00`` (here ``milliseconds``) might be different, this has to be aligned to the higher resolution (retrieve new unit). This is done by multiplying the ``values`` by the ratio of nanoseconds per time unit and nanoseconds per reference time unit. To not break consistency for ``NaT`` a mask is kept and re-introduced after the multiplication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For floats, I think we'll need to pick some default time resolution. Microseconds is probably a sane choice, certainly better than nanoseconds?

Possibly it also makes sense to issue an warning in these cases (unless explicitly silenced via some form of Xarray decoding option) so that users know what is going on.

with numpy and smooth integration with pandas.
Xarray uses the numpy dtypes ``datetime64[unit]`` and ``timedelta64[unit]``
(where unit is anything of "s", "ms", "us" and "ns") to represent datetime
data, which offer vectorized (if sometimes buggy) operations with numpy and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe delete (if sometimes buggy) now?

The only real bug that I'm familiar with is related to what NumPy considers as the built-in Python scalar corresponding to ns precision, which is some form of int instead of datetime.datetime. This is mitigated significantly by allowing for other datetime64 units.

doc/user-guide/time-series.rst Outdated Show resolved Hide resolved
Comment on lines +444 to +445
if unit not in ["s", "ms", "us", "ns"]:
unit = "s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this raise an error instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no. pd.to_timedelta().to_numpy() returns timedelta64[ns]. The following check and astype just make sure, that the result is in the lowest possible resolution but within the range of TimedeltaIndex (["s", "ms", "us", "ns"]).

Comment on lines +465 to +466
if unit not in ["s", "ms", "us", "ns"]:
as_unit = "s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concern as above -- should this raise an error instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, but the other way round (decoding). pd_to_timedelta returns ns resolution. This makes sure the data is returned in the lowest possible resolution the intermediate TimedeltaIndex can handle. So for decoding "days" we finally get s resolution, which seems more appropriate than ns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note, and completely unnecessary optimization, but a set is faster than a list in a literal like that even for short lists. It's the essentially the same if it's the first item and faster if it's the last. So I've been using sets in these kinds of constructs :-)

In [9]: %timeit "s" in ["s", "ms", "us", "ns"]
26.5 ns ± 2.79 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [10]: %timeit "s" in {"s", "ms", "us", "ns"}
27.4 ns ± 1 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [11]: %timeit "ns" in ["s", "ms", "us", "ns"]
60 ns ± 1.37 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [12]: %timeit "ns" in {"s", "ms", "us", "ns"}
28.2 ns ± 1.31 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Sorry for the distraction :-) -- great work on this PR -- thanks!

Comment on lines +270 to +271
time_resolution : {"s", "ms", "us", "ns"}, default: "s"
Time resolution used for CF encoding/decoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, we should prefer local control for setting a default time precision. This would logical to do with encoding, e.g., encoding['dtype'] = "datetime64[s]", or perhaps via keyword argument added to xarray.decode_cf().

Once we do this, do we need this global option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first place I thought the global option would be a neat solution. But it has also it's drawbacks (like with multiprocessing). I'm working towards your suggestion.

Comment on lines +101 to +102
if units.index(default) > units.index(dt.unit):
dt = _timestamp_as_unit(dt, default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little surprising. From the docstring of this function, I would expect it to always convert the units -- or perhaps raise an error is that isn't possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, maybe the naming of the function/docstring is not really appropriate. This should only convert to higher resolutions, if needed. But not the other way round, as this might mean loss of resolution. I'll look into this, thanks!

Comment on lines +251 to +254
if isinstance(data, datetime):
data = np.datetime64(data)
if isinstance(data, timedelta):
data = np.timedelta64(getattr(data, "value", data), "ns")
data = np.timedelta64(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pandas still converts scalar datetime.datetime and datetime.timedelta objects to nanosecond precision values instead of microsecond precision values

This behavior seems quite bad from pandas. Hopefully there's willingness to fix it upstream?

xarray/core/variable.py Outdated Show resolved Hide resolved
@kmuehlbauer
Copy link
Contributor Author

Thanks Stephan for the review. Looking into that next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants