-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax nanosecond datetime restriction in CF time coding #9618
base: main
Are you sure you want to change the base?
Conversation
Nice, mypy 1.12 is out and breaks our typing, 😭. |
Can we pin it in the CI temporarily? |
Yes, 1.11.2 was the last version. |
ca5050d
to
f7396cf
Compare
This is now ready for a first round of review. I think this is already in a quite usable state. But no rush, this should be thoroughly tested. |
Sounds good @kmuehlbauer! I’ll try and take an initial look this weekend. |
…ore/variable.py to use any-precision datetime/timedelta with autmatic inferring of resolution
…ocessing, raise now early
…t resolution, fix code and tests to allow this
1249d83
to
48be73a
Compare
for more information, see https://pre-commit.ci
… more carefully, for now using pd.Series to covert `OMm` type datetimes/timedeltas (will result in ns precision)
@spencerkclark Thanks for taking the time to review. I've tried to be explicit with comments and maybe also my doc ramblings will help. I've also tried to fixup the current docs on time handling. There are still some issues when creating datetime/timedelta from scratch which currently flow finally through pd.Series if not converted by other means beforehand. So, there is still some glitches which have to be ironed out. Anyway, thanks and good luck! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kmuehlbauer—I think this is coming together well. I'll provide a more detailed review eventually. In my mind there are still some high-level things to consider:
- To keep breaking changes from happening abruptly, we may want to have a period where the default decoding
time_resolution
remains"ns"
before changing to something else. From my initial scan of how you modified the tests—maybe with foresight—I think it would be straightforward to switch back to defaulting to"ns"
precision, which is nice. However, we still would want a way to test non-nanosecond precision behavior. Is this what you have been thinking about or did you have something else in mind? I'm happy to hear other points of view. At least in the case of casting in the Variable constructor we have been warning about this for a while. - How to best handle choosing a precision when decoding times encoded as floats is still an open question. I left a couple thoughts below, but I think it may need more discussion.
- We should think more about how we handle casting when constructing Variable objects with
datetime.datetime
ordatetime.timedelta
objects—our current approach may not be up to the task for objects that cannot be represented with nanosecond precision, though this may be considered an upstream issue.
Something else to keep on our radar is the discussion in pandas-dev/pandas#58989. It looks like pandas is reconsidering how to handle precision inference in various contexts—I think what they are leaning towards is pretty similar to what you have currently for decoding, which is good.
# todo: decide how to work with this | ||
# as initialized by string pd.Timestamp is defined only from year -9999-01-01 to 9999-12-31 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you were leaning towards letting pandas infer the precision from the string representation of the dates, but that is not an option because of this? I think what you have is reasonable, since cftime objects have microsecond precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't realize this actually fails too:
>>> pd.Timestamp(10000, 1, 1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "timestamps.pyx", line 1850, in pandas._libs.tslibs.timestamps.Timestamp.__new__
ValueError: year 10000 is out of range
Maybe we should convert through datetime.datetime
now that we don't necessarily need to worry about overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This moves probably through the same code path ("string" -> (years, months, days, etc) -> Timestamp()).
This is a little bit annoying, but given the specialties I'll not blame pandas in any way. The move to non-nanosecond resolution was a major undertaking and issues will be ironed out as they appear, I think.
Yes, we can try to as you suggest.
Since relaxing the resolution this enhances the range to several hundreds of thousands of centuries with microsecond representation. ``NaT`` will be at ``np.iinfo("int64").min`` for all of the different representations. | ||
|
||
.. warning:: | ||
When initialized with a datetime string this is only defined from ``-9999-01-01`` to ``9999-12-31``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, so pandas currently does not support date string indexing or creating date ranges with strings with years greater than 10000. I didn't realize this. It's by no means a blocker (it's an upstream issue), but it would be nice to have at some point.
I guess that means we have to decode times with reference dates with years greater than or equal to 10000 with cftime for the time being, which locks us into microsecond precision in that scenario without doing something more clever in cftime_to_nptime
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so it seems. I tried, but no neat solution came to mind.
|
||
2. ``standard``/``gregorian`` calendar and ``proleptic_gregorian`` are equivalent for any dates and reference times >= ``1582-10-15``. First the reference time is checked and any timezone information stripped off and in a second step, the minimum and maximum ``values`` are checked if they can be represented in the current reference time resolution. At the same time integer overflow would be caught. For ``standard``/``gregorian`` calendar the dates are checked to be >= ``1582-10-15``. If anything fails, the decoding is done with ``cftime``). | ||
|
||
3. As the time unit (here ``seconds``) and the resolution of the reference time ``1992-10-8 15:15:42.5 -6:00`` (here ``milliseconds``) might be different, this has to be aligned to the higher resolution (retrieve new unit). This is done by multiplying the ``values`` by the ratio of nanoseconds per time unit and nanoseconds per reference time unit. To not break consistency for ``NaT`` a mask is kept and re-introduced after the multiplication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still worry about times encoded as floats. The following example currently truncates precision without warning:
>>> values = np.array([0, 0.25, 0.5, 0.75, 1.0], dtype="float64")
>>> units = "seconds since 2000-01-01"
>>> calendar = "standard"
>>> xr.coding.times.decode_cf_datetime(values, units, calendar)
array(['2000-01-01T00:00:00', '2000-01-01T00:00:00',
'2000-01-01T00:00:00', '2000-01-01T00:00:00',
'2000-01-01T00:00:01'], dtype='datetime64[s]')
I don't know what the best solution is. Always using nanosecond precision for floats would be the safest / simplest option, but that is a bit heavy handed (and is probably not what most people would want—times encoded with floats often don't need that level of precision, and it would limit the range of dates that were representable). We could make the user-configurable minimum precision used for decoding floats separate from that used for integers, but that would introduce another parameter in set_options
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, special casing floats, also another parameter as suggested is a good step forward. I'll think about that more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For floats, I think we'll need to pick some default time resolution. Microseconds is probably a sane choice, certainly better than nanoseconds?
Possibly it also makes sense to issue an warning in these cases (unless explicitly silenced via some form of Xarray decoding option) so that users know what is going on.
@@ -137,7 +132,7 @@ Conversion between non-standard calendar and to/from pandas DatetimeIndexes is | |||
facilitated with the :py:meth:`xarray.Dataset.convert_calendar` method (also available as | |||
:py:meth:`xarray.DataArray.convert_calendar`). Here, like elsewhere in xarray, the ``use_cftime`` | |||
argument controls which datetime backend is used in the output. The default (``None``) is to | |||
use `pandas` when possible, i.e. when the calendar is standard and dates are within 1678 and 2262. | |||
use `pandas` when possible, i.e. when the calendar is standard and dates starting with 1582-10-15. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should note that with a proleptic Gregorian calendar there is no date restriction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is only a limitation for standard
/gregorian
. This needs change.
@@ -32,6 +32,7 @@ | |||
"use_numbagg", | |||
"use_opt_einsum", | |||
"use_flox", | |||
"time_resolution", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we should have some way of noting that this only applies to NumPy dates (not cftime), and is only relevant to decoding. Would a more precise name be something like "minimum_numpy_time_decoding_precision"
, since setting this doesn't guarantee the precision of the decoded times?
It's kind of verbose though; do you have any thoughts on a clearer name? I guess one edge case where this is not necessarily the minimum precision is that you could end up with microsecond precision if we fall back to decoding times with cftime (due to dates being out of range) and convert back to NumPy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, as this switch/parameter will not be used very often in the code, we can come up with a more speaking name. I'd go with your suggestion, if we do not come up with anything better.
], | ||
) | ||
def test_infer_timedelta_units(deltas, expected) -> None: | ||
# todo: why testing, the same thing four times? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh good catch—it looks like this was unintentionally introduced in #9417. The other three originally were:
(pd.to_timedelta(["1h", "1 day 1 hour"]), "hours"),
(pd.to_timedelta(["1m", "2m", np.nan]), "minutes"),
(pd.to_timedelta(["1m3s", "1m4s"]), "seconds"),
Maybe we should restore these in a small separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, copy&pasta error, I see. Yes, should be fixed in main, separately.
@@ -298,11 +246,12 @@ def convert_non_numpy_type(data): | |||
data = utils.to_0d_object_array(data) | |||
|
|||
if isinstance(data, pd.Timestamp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scalar handling logic is quite old—I think we might be able to defer to _possibly_convert_objects
, which has the virtue that it will let pandas decide how to handle these and be consistent with how lists of these objects will be converted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great if this could all flow through _possibly_convert_objects
.
if isinstance(data, datetime): | ||
data = np.datetime64(data) | ||
if isinstance(data, timedelta): | ||
data = np.timedelta64(getattr(data, "value", data), "ns") | ||
data = np.timedelta64(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pandas still converts scalar datetime.datetime
and datetime.timedelta
objects to nanosecond precision values instead of microsecond precision values. Should we let pandas do that through _possibly_convert_objects
below, or short circuit that (as you have it implemented currently) here?
I maybe lean towards the former, since it maintains the existing behavior, and is consistent with how things are handled in the case of lists or arrays of datetime.datetime
or datetime.timedelta
objects, though I agree in a vacuum the latter makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But unfortunately our approach using pd.Series
to handle casting breaks down in the case of objects that cannot be represented with nanosecond precision, e.g.:
>>> pd.Series(datetime.datetime(1, 1, 1))
0 0001-01-01 00:00:00
dtype: object
In other words we get an object-dtype Series. I wonder if there is an open pandas issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to drop pd.Series here, but failed. So I'm open to any solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pandas still converts scalar datetime.datetime and datetime.timedelta objects to nanosecond precision values instead of microsecond precision values
This behavior seems quite bad from pandas. Hopefully there's willingness to fix it upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree, we should inquire about it if there is not an open issue already.
try: | ||
pd.to_datetime([np.datetime64(1901901901901, "s")]) | ||
except Exception as err: | ||
print("Raises:", err) | ||
try: | ||
pd.to_datetime(1901901901901, unit="s") | ||
except Exception as err: | ||
print("Raises:", err) | ||
try: | ||
pd.to_datetime([1901901901901], unit="s") | ||
except Exception as err: | ||
print("Raises:", err) | ||
try: | ||
pd.to_datetime(np.array([1901901901901]), unit="s") | ||
except Exception as err: | ||
print("Raises:", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess these errors are not that surprising given that pd.to_datetime
assumes you want to convert to nanosecond precision if provided with numeric values. We would need some way of specifying we wanted a different precision, i.e. pandas-dev/pandas#49060 (here the unit
refers to how to interpret the numerical value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was more of documenting this for me, because I've tripped over this quite some time ;-)
.. warning:: | ||
Input data with resolution higher than ``'ns'`` (eg. ``'ps'``, ``'fs'``, ``'as'``) is truncated (not rounded) at the ``'ns'``-level. This is currently broken for the ``'ps'`` input, where it is interpreted as ``'ns'``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for investigating this thoroughly—I agree there is some strange behavior here. It could be worth reporting to pandas if there are not issues already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create an issue on the pandas tracker, when I'm back next week.
Many thanks @spencerkclark!
Yes, I agree, the best way forward is to keep current behaviour but at the same time allow users to opt-in to relaxed processing. I think this is more or less a straightforward change in code and in the tests.
Yes, we would need to special cases floats. It seems I was not thinking out all issues.
This is where I didn't find a neat solution, so we either can search along or just accept that upstream has some issues.
Great, I wasn't aware of that discussion. Good to see it goes into a similar direction. Due to traveling I will not have time to further work on this until next week, also might not reply in timely manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, this looks like a big step in the right direction. Thanks for tackling this, Kai!
|
||
2. ``standard``/``gregorian`` calendar and ``proleptic_gregorian`` are equivalent for any dates and reference times >= ``1582-10-15``. First the reference time is checked and any timezone information stripped off and in a second step, the minimum and maximum ``values`` are checked if they can be represented in the current reference time resolution. At the same time integer overflow would be caught. For ``standard``/``gregorian`` calendar the dates are checked to be >= ``1582-10-15``. If anything fails, the decoding is done with ``cftime``). | ||
|
||
3. As the time unit (here ``seconds``) and the resolution of the reference time ``1992-10-8 15:15:42.5 -6:00`` (here ``milliseconds``) might be different, this has to be aligned to the higher resolution (retrieve new unit). This is done by multiplying the ``values`` by the ratio of nanoseconds per time unit and nanoseconds per reference time unit. To not break consistency for ``NaT`` a mask is kept and re-introduced after the multiplication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For floats, I think we'll need to pick some default time resolution. Microseconds is probably a sane choice, certainly better than nanoseconds?
Possibly it also makes sense to issue an warning in these cases (unless explicitly silenced via some form of Xarray decoding option) so that users know what is going on.
with numpy and smooth integration with pandas. | ||
Xarray uses the numpy dtypes ``datetime64[unit]`` and ``timedelta64[unit]`` | ||
(where unit is anything of "s", "ms", "us" and "ns") to represent datetime | ||
data, which offer vectorized (if sometimes buggy) operations with numpy and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe delete (if sometimes buggy)
now?
The only real bug that I'm familiar with is related to what NumPy considers as the built-in Python scalar corresponding to ns
precision, which is some form of int
instead of datetime.datetime
. This is mitigated significantly by allowing for other datetime64 units.
if unit not in ["s", "ms", "us", "ns"]: | ||
unit = "s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this raise an error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think no. pd.to_timedelta().to_numpy()
returns timedelta64[ns]
. The following check and astype just make sure, that the result is in the lowest possible resolution but within the range of TimedeltaIndex
(["s", "ms", "us", "ns"]).
if unit not in ["s", "ms", "us", "ns"]: | ||
as_unit = "s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concern as above -- should this raise an error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, but the other way round (decoding). pd_to_timedelta
returns ns
resolution. This makes sure the data is returned in the lowest possible resolution the intermediate TimedeltaIndex
can handle. So for decoding "days" we finally get s
resolution, which seems more appropriate than ns
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note, and completely unnecessary optimization, but a set is faster than a list in a literal like that even for short lists. It's the essentially the same if it's the first item and faster if it's the last. So I've been using sets in these kinds of constructs :-)
In [9]: %timeit "s" in ["s", "ms", "us", "ns"]
26.5 ns ± 2.79 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [10]: %timeit "s" in {"s", "ms", "us", "ns"}
27.4 ns ± 1 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [11]: %timeit "ns" in ["s", "ms", "us", "ns"]
60 ns ± 1.37 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [12]: %timeit "ns" in {"s", "ms", "us", "ns"}
28.2 ns ± 1.31 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
Sorry for the distraction :-) -- great work on this PR -- thanks!
time_resolution : {"s", "ms", "us", "ns"}, default: "s" | ||
Time resolution used for CF encoding/decoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, we should prefer local control for setting a default time precision. This would logical to do with encoding, e.g., encoding['dtype'] = "datetime64[s]"
, or perhaps via keyword argument added to xarray.decode_cf()
.
Once we do this, do we need this global option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first place I thought the global option would be a neat solution. But it has also it's drawbacks (like with multiprocessing). I'm working towards your suggestion.
if units.index(default) > units.index(dt.unit): | ||
dt = _timestamp_as_unit(dt, default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little surprising. From the docstring of this function, I would expect it to always convert the units -- or perhaps raise an error is that isn't possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, maybe the naming of the function/docstring is not really appropriate. This should only convert to higher resolutions, if needed. But not the other way round, as this might mean loss of resolution. I'll look into this, thanks!
if isinstance(data, datetime): | ||
data = np.datetime64(data) | ||
if isinstance(data, timedelta): | ||
data = np.timedelta64(getattr(data, "value", data), "ns") | ||
data = np.timedelta64(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pandas still converts scalar datetime.datetime and datetime.timedelta objects to nanosecond precision values instead of microsecond precision values
This behavior seems quite bad from pandas. Hopefully there's willingness to fix it upstream?
Thanks Stephan for the review. Looking into that next week. |
whats-new.rst
This is another attempt to resolve #7493. This goes a step further than #9580.
The idea of this PR is to automatically infer the needed resolutions for decoding/encoding and only keep the constraints pandas imposes ("s" - lowest resolution, "ns" - highest resolution). There is still the idea of a
default resolution
, but this should only take precedence if it doesn't clash with the automatic inference. This can be discussed, though. Update: I've implemented a first try to have default resolution on decode, which will override the current inferred resolution only to higher resolution (eg.'s'
->'ns'
).For sanity checking, and also for my own good, I've created a documentation page on time-coding in the internal dev section. Any suggestions (especially grammar) or ideas for enhancements are much appreciated.
There still might be room for consolidation of functions/methods (mostly in coding/times.py), but I have to leave it alone for some days. I went down that rabbit hole and need to relax, too 😬.
Looking forward to get your insights here, @spencerkclark, @ChrisBarker-NOAA, @pydata/xarray.