Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add .attrs to highlevel objects #2757

Merged
merged 13 commits into from
Nov 8, 2023
Merged

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Oct 13, 2023

This will partially fix #1391 by adding .attrs, a dictionary of metadata that is associated with the top-level array, with the following semantics:

  • Operations with multiple arrays take the set-union of properties, first wins
  • Keys beginning with a special prefix (@) will be removed at serialisation time.

Whilst touching some of this code, I did some work on type hints and refactoring.

@agoose77
Copy link
Collaborator Author

Design question (1) — should array.attrs always return a dict, or can it return None?

Return None

(invert the below for "always return a dict")

  • Convenient token for "no metadata", allowing optimisations (including non-Awkward cases)
  • Harder to set metadata in a non-functional style (unless metadata already set)

@jpivarski
Copy link
Member

The _parameters are internally None or dict for performance, but the external parameters property always returns a dict for uniformity. In fact, the property creates the dict so that it can be mutably updated from that point onward.

I think this is a good compromise, and it would follow the principle of least surprise if attrs does the same thing. In fact, I think this is what behaviors does, too.

You were only asking about how it looks from the outside, so I'm voting "always return a dict." But since most users won't be using attrs (that's the situation we're starting from before this first implementation), so using None internally minimizes the performance impact of adding a new feature across all ak.Arrays.

If we really need a has_attrs: bool (more likely a has_attr(name)), then we can add it.


Oh, unlike parameters, I think we should not conflate key-not-found with value-is-None. We should allow users to consider None a meaningful value for an attribute, distinct from not having that attribute. attrs is the users' space, much more so than parameters.

@agoose77
Copy link
Collaborator Author

The _parameters are internally None or dict for performance, but the external parameters property always returns a dict for uniformity. In fact, the property creates the dict so that it can be mutably updated from that point onward.
...
I think this is a good compromise, and it would follow the principle of least surprise if attrs does the same thing. In fact, I think this is what behaviors does, too.

Actually, Array.behavior returns None | dict, making it possible to predict whether the array will use the global behavior lookup.

attrs doesn't have that benefit; there's only a single namespace to look at, so this motivation for exposing None disappears. I think it's therefore OK to hide the detail from users.


Oh, unlike parameters, I think we should not conflate key-not-found with value-is-None. We should allow users to consider None a meaningful value for an attribute, distinct from not having that attribute. attrs is the users' space, much more so than parameters.

Agreed.

@agoose77 agoose77 force-pushed the agoose77/feat-array-attrs branch 10 times, most recently from ffd0597 to 4a7361e Compare October 19, 2023 22:08
@agoose77 agoose77 marked this pull request as ready for review October 19, 2023 22:11
@agoose77 agoose77 temporarily deployed to docs October 19, 2023 22:18 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs October 19, 2023 22:48 — with GitHub Actions Inactive
@agoose77 agoose77 added the pr-on-hold This PR is inactive due to a pending decision or other constraint label Oct 20, 2023
@agoose77
Copy link
Collaborator Author

This PR indicates that it would be nice to have something like

ctx = HighLevelCtx()

layouts = ctx.finalise([
    ak.to_layout(ctx.maybe_highlevel(a)) for a in arrays
])

backend = ctx.backend
behavior = ctx.behavior

Rather than the existing stateless functional API that revisits inputs multiple times. This function would ensure all layouts have the proper "final" backend.

As such, it's on-hold for PR #2763 and another PR.

@agoose77 agoose77 marked this pull request as draft October 23, 2023 17:34
@agoose77 agoose77 force-pushed the agoose77/feat-array-attrs branch from bf62d43 to c346295 Compare October 30, 2023 11:53
@pre-commit-ci pre-commit-ci bot temporarily deployed to docs October 30, 2023 12:11 Inactive
@agoose77 agoose77 force-pushed the agoose77/feat-array-attrs branch 5 times, most recently from b8f6058 to 78231ff Compare October 30, 2023 23:13
@agoose77
Copy link
Collaborator Author

@jpivarski this is a big PR.

I've split it into commits that should be a bit more manageable.

I still need to do a pass to confirm that the various policy settings used by ak.to_layout are sensible, but I'd benefit from the main review to confirm that the (internal) API changes are a step in the right direction.

The purpose of the HighLevelContext addition to this PR was make the general idiom of pulling behaviors off of high-level object(s) and applying them to the results is less error prone. The use of ctx.wrap_layout leads to an Exception if the context hasn't been explicitly finalised, making it harder to pass the wrong behavior argument around.

There's a test failure pertaining to the choice to permit records, that I'll revisit.

@agoose77 agoose77 requested a review from jpivarski October 30, 2023 23:14
@agoose77
Copy link
Collaborator Author

Do we have a formal policy regarding which functions should prohibit/support record objects? e.g. ak.is_none on a record.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a big PR with a lot of differences, although many of them come from adding the single extra attrs argument to lots of functions, which in many cases turns them from single-line to multi-line due to black.

The context object is a big difference. I see that this context object is private, for our own use, and I'm not certain that it's necessary for keeping track of two pieces of data (behavior and attrs), but it's fine.

(In your comment, I thought you were referring to a public context object, so that people can say,

with ak.Context(behavior, attrs):
    ak.something...

although that doesn't make a lot of sense: one would want the behavior and attrs to be permanently glued to specific objects and not others. But I guess you were talking about a private context.)

I'm in favor of the direction that this PR is heading.

src/awkward/_attrs.py Outdated Show resolved Hide resolved
@jpivarski jpivarski removed the pr-on-hold This PR is inactive due to a pending decision or other constraint label Nov 7, 2023
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge work, and tests/test_2757_attrs_metadata.py is very extensive, parameterizing over so many functions to make sure that attrs are correctly passed through all of them.

My only request is about Numba: see below. If you have questions about that, I can help during our meeting tomorrow.

src/awkward/_connect/numba/arrayview.py Outdated Show resolved Hide resolved
@agoose77 agoose77 merged commit 8a2fa20 into main Nov 8, 2023
35 checks passed
@agoose77 agoose77 deleted the agoose77/feat-array-attrs branch November 8, 2023 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xarray-style "attrs", global and per-record field
2 participants