Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1445416, SNOW-1445419: Implement DataFrame/Series.attrs #2386

Merged
merged 20 commits into from
Oct 22, 2024

Conversation

sfc-gh-joshi
Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi commented Oct 1, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1445416 and SNOW-1445419

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe
  3. Please describe how your code solves the related issue.

Implements DataFrame/Series.attrs by adding a new query compiler variable _attrs that is read out by frontend objects. A new annotation on the query compiler, _propagate_attrs_on_methods, will either copy _attrs from self to the return value, or reset _attrs on the return value.

I initially intended to implement this solely at the frontend layer with the override system (similar to how telemetry is added to all methods), but this created difficulties when preserving attrs across in-place operations like df.columns = [...], and could create ambiguity if the frame had a column named "_attrs". Implementing propagation at the query compiler level is simpler.

This PR also adds a new test_attrs=True parameter to eval_snowpark_pandas_result. eval_snowpark_pandas_result will set a dummy value of attrs on its inputs, and ensure that if the result is a DF/Series, the attrs field on the result matches that of pandas. Since pandas isn't always consistent about whether it propagates attrs or resets it (for some methods, the behavior depends on the input, and for some methods, it is inconsistent between Series/DF), setting test_attrs=False skips this check. When I encountered such inconsistent methods, I elected to have Snowpark pandas always propagate attrs, since it seems unlikely that users would rely on the attrs of a result being empty if they did not explicitly set it.

@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-1445416-df-attrs branch from 7aff78e to 674d689 Compare October 3, 2024 00:19
@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-1445416-df-attrs branch from fb1b0d8 to e1ac41f Compare October 10, 2024 22:26
@github-actions github-actions bot locked and limited conversation to collaborators Oct 15, 2024
@sfc-gh-joshi sfc-gh-joshi reopened this Oct 15, 2024
@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-1445416-df-attrs branch from 13aec05 to 4b18f6a Compare October 15, 2024 21:43
@sfc-gh-joshi sfc-gh-joshi marked this pull request as ready for review October 16, 2024 22:11
@sfc-gh-joshi sfc-gh-joshi requested a review from a team as a code owner October 16, 2024 22:11
Copy link
Contributor

@sfc-gh-helmeleegy sfc-gh-helmeleegy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work adding all the tests!

@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-1445416-df-attrs branch from 860cacb to 9a3a9b8 Compare October 21, 2024 20:39
@functools.wraps(method)
def wrap(self, *args, **kwargs): # type: ignore
result = method(self, *args, **kwargs)
if isinstance(result, SnowflakeQueryCompiler) and len(self._attrs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this check len(self.attrs) for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It checks that self._attrs was not empty, and avoids setting result._attrs if so, since intermediate calls to query compiler operations may have already set result._attrs = self._attrs. This lets us potentially avoid creating a new empty dict for result._attrs if it's already empty (not a very important optimization, but I think it's good practice anyway).

@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-1445416-df-attrs branch from a5f3248 to 0654156 Compare October 22, 2024 21:59
@sfc-gh-joshi sfc-gh-joshi enabled auto-merge (squash) October 22, 2024 21:59
@sfc-gh-joshi sfc-gh-joshi merged commit 0b56f4b into main Oct 22, 2024
38 of 40 checks passed
@sfc-gh-joshi sfc-gh-joshi deleted the joshi-SNOW-1445416-df-attrs branch October 22, 2024 23:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants