Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle pandas timestamp with nanosecs precision #49370

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

srinathk10
Copy link
Contributor

Why are these changes needed?

Handle pandas timestamp with nanosecs precision

Related issue number

"Closes #49297"

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@srinathk10 srinathk10 requested a review from a team as a code owner December 19, 2024 21:54
Copy link
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments about the tests, I also wonder if we could somehow fold all these tests into a single parameterized test that is agnostic of the type (e.g. use the expected type in the parameters somewhere to know what type to expect. Would help with the standardization across the tests.

# Convert Python datetime to pandas Timestamp with nanosecond precision
value = pd.Timestamp(value)
value = pa.array([value], type=pa.timestamp("ns"))[0]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any ways that this conversion can fail or precision can be lost? I imagine probably not but might be good to note anything here if there are.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey again, piping in from the original issue.

If this is already a datetime object, hasn't it already lost nanosecond precision? Is it too late to perform this coercion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default behavior is type coercion as we discovered in the bug from the test case. This is explicitly handling the type conversion for nanoseconds.

"df, expected_df",
[
(
create_timestamp_dataframe(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am missing something this looks identical to the result of create_timestamp_dataframe? Should we just call that function once? Should we only pass in one dataframe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see now that there is a 1ns difference, can we include a comment to specify that? Or could just have one DF passed in and do the addition of 1ns in the test itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check if I can simplify this without limiting the correctness check.

processed_df = result.to_pandas()
assert processed_df.shape == df.shape, "DataFrame shapes do not match"
pd.testing.assert_frame_equal(processed_df, expected_df)
assert (processed_df["timestamp"] - df["timestamp"]).iloc[0] == pd.Timedelta(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these last three asserts redundant with the above equality check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is checking for all 3 rows.

result = ray_data.map(process_timestamp_data)
processed_df = result.to_pandas()
# Ensure numpy.datetime64 is correctly converted to pandas Timestamp
assert isinstance(processed_df["timestamp"].iloc[0], pd.Timestamp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just check if the series for the column has the correct type?

assert isinstance(processed_df["timestamp"].iloc[0], pd.Timestamp)

# Check that the timestamp has been incremented by 1ns
assert (processed_df["timestamp"] - df["timestamp"]).min() == pd.Timedelta(1, "ns")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we standardize how we are checking the diff of 1ns? Seems like we are doing this, the equality check with the expected check and also individually checking the differences. Not super opinionated on which is the clearest but let's just use one (unless I am missing some subtlety here)
.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is datetime vs np.datetime64. Let me consolidate this code better.

srinathk10 and others added 6 commits December 23, 2024 04:41
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: srinathk10 <[email protected]>
Signed-off-by: srinathk10 <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: srinathk10 <[email protected]>
@srinathk10 srinathk10 added the go add ONLY when ready to merge, run all tests label Dec 23, 2024
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Copy link
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning up the tests! We could probably consolidate the two slightly more but approving regardless to unblock.

)
],
)
def test_map_numpy_datetime(df, expected_df, ray_start_regular_shared):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It seems like test_map_numpy_datetime and test_map_timestamp_nanosecs have the same body? Can probably move this to be one test with two different sets of parameters.

if isinstance(value, (pd.Timestamp, np.datetime64)):
# If it's a pandas Timestamp or numpy datetime64, convert to pyarrow
# Timestamp
value = pa.array([value], type=pa.timestamp("ns"))[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the target block type is pandas, will converting it to pyarrow Timestamp be compatible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably put this in the arrow_block.py subclass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, can you comment that the purpose of this conversion is to avoid losing precision?

)
],
)
def test_map_timestamp_nanosecs(df, expected_df, ray_start_regular_shared):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • can you document the purpose of each test?
  • maybe put them in test_pandas.py, as the issue is specific to Pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Inconsistent behavior with Ray Data and timestamps
4 participants