-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Inconsistent behavior with Ray Data and timestamps #49297
Comments
@jcotant1 the crux for both issues is around how we handle nanoseconds
|
Based on the warnings logged from ray data, |
Test case for this.
|
for the 2nd issue, i believe it's because Ray Data relies on PyArrow's Pandas integration. PyArrow doesn't seem to respect the conversion table here if the timestamps are inside a StructArray Repro:
|
I believe that's the exact issue. I'm unfortunately not sure if pandas will be able to iterate on a fix quickly in this case. Perhaps the ray team has a better relationship with the pandas devs than I (a relative stranger) may have? A pandas-centric fix also means that users are forced to use a specific version of pandas in their environments. |
Would you be able to convert the timestamp to int64? This should handle nanosecs precision.
|
Unfortunately, this means that whoever uses the data is now responsible for remembering to convert it back to a timestamp (in each transform they This is how we currently "patch" the behavior, but it requires an extra layer of API that is rather expensive to maintain instead of just using ray data as a product directly. |
I actually run into the same bug, however, I only became aware when ray data raised an error during This happens in cases there the type of the timestamp columns of two blocks are inferred differently: |
## Why are these changes needed? Handle pandas timestamp with nanosecs precision ## Related issue number "Closes ray-project#49297" --------- Signed-off-by: Srinath Krishnamachari <[email protected]>
## Why are these changes needed? Handle pandas timestamp with nanosecs precision ## Related issue number "Closes ray-project#49297" --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>
What happened + What you expected to happen
*This example has been provided as a gist instead of in-line code to make it easier to reproduce and trace.
Two odd behaviors are identified:
.map()
API..map_batches()
(or.take_batches()
) API.The first issue is isolated to just
ray.data
, while the second is demonstrated to fail with thepandas
API as well.This currently blocks me from using ray data as I cannot extract timestamps consistently across my data sources.
Versions / Dependencies
Name: numpy
Version: 1.24.4
Name: pandas
Version: 2.0.3
Name: pyarrow
Version: 17.0.0
Name: ray
Version: 2.10.0
Name: host
Version: Darwin [hostname-redacted] 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:13:18 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6030 arm64
Reproduction script
Gist: https://gist.github.com/NellyWhads/fdfb261a027be7e7bc87bec91d9e9035
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: