Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add policy for length erasure #2765

Closed
wants to merge 3 commits into from

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Oct 22, 2023

Fix #2764 by adding support for recursive erasure of length information.

Instead of forget_length: bool, we have forget_length: Literal["keep", "drop_outer", "drop_recursive"], which enables us to decide how deep the lengths are forgotten. For the Form → Content case, we always want recursively to drop lengths.

Warning
This PR introduces a deprecation in the forget_length argument.

@agoose77 agoose77 requested a review from jpivarski October 22, 2023 15:50
@codecov
Copy link

codecov bot commented Oct 22, 2023

Codecov Report

Merging #2765 (a5d0892) into main (fa4b2bc) will decrease coverage by 0.01%.
The diff coverage is 94.11%.

Additional details and impacted files
Files Coverage Δ
src/awkward/_nplikes/typetracer.py 76.47% <ø> (ø)
src/awkward/contents/bitmaskedarray.py 69.25% <100.00%> (+0.09%) ⬆️
src/awkward/contents/bytemaskedarray.py 89.07% <100.00%> (ø)
src/awkward/contents/emptyarray.py 75.12% <100.00%> (ø)
src/awkward/contents/indexedarray.py 78.99% <100.00%> (ø)
src/awkward/contents/indexedoptionarray.py 88.48% <100.00%> (ø)
src/awkward/contents/listarray.py 88.38% <100.00%> (ø)
src/awkward/contents/listoffsetarray.py 82.77% <100.00%> (ø)
src/awkward/contents/numpyarray.py 91.22% <100.00%> (ø)
src/awkward/contents/recordarray.py 84.55% <100.00%> (ø)
... and 8 more

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an L2 public function, so we can, in principle, be strict with users. However, it doesn't seem to be necessary here. Changing

forget_length: bool

into

length_policy: Literal["keep", "drop_outer", "drop_recursive"]

requires calling functions to change, so so there's a deprecation cycle. But it could have been changed to

forget_length: bool | Literal["false", "true", "recursive"]

such that False corresponds to "false", True corresponds to "true" (dropping just the outer length, which is the old behavior), and "recursive" is the only new behavior: dropping lengths recursively.

That way, there would be no need to make downstream libraries change. Since we're strictly extending the old behavior (new set of options is a superset of the old set of options) and there's a natural way to keep using the old words while introducing a new word, this is less of an imposition. Only downstream libraries that want the new behavior need to do anything; others don't even need to know about the change. (In addition, I have a more immediate understanding of what "forget length" means than "length policy".)

As another alternative, perhaps

forget_length: bool | Literal["recursive"]

to be tighter? It depends on whether it's important for all options to have identical type. (Well, they do, even now: it's a union type.)

Also, why string literals and not enums? Do any enums naturally map onto false and true?

@agoose77
Copy link
Collaborator Author

This is an L2 public function, so we can, in principle, be strict with users. However, it doesn't seem to be necessary here. Changing

forget_length: bool

into

length_policy: Literal["keep", "drop_outer", "drop_recursive"]

requires calling functions to change, so so there's a deprecation cycle. But it could have been changed to

forget_length: bool | Literal["false", "true", "recursive"]

such that False corresponds to "false", True corresponds to "true" (dropping just the outer length, which is the old behavior), and "recursive" is the only new behavior: dropping lengths recursively.

My personal preference is fairly strongly against mixing types like this, especially at L2 where we can move further away from interactive-friendly APIs towards stricter typing.

That way, there would be no need to make downstream libraries change. Since we're strictly extending the old behavior (new set of options is a superset of the old set of options) and there's a natural way to keep using the old words while introducing a new word, this is less of an imposition. Only downstream libraries that want the new behavior need to do anything; others don't even need to know about the change.

That is a benefit, though, of gradually extending the argument.

(In addition, I have a more immediate understanding of what "forget length" means than "length policy".)

Naming things is hard :( I like policy because it implies to me that there's more than two options.

As another alternative, perhaps

forget_length: bool | Literal["recursive"]

to be tighter? It depends on whether it's important for all options to have identical type. (Well, they do, even now: it's a union type.)

Also, why string literals and not enums? Do any enums naturally map onto false and true?

Enum's don't play as nicely with typing as I would like. I don't think you can type a function such that it accepts a literal string or an enum value; even StrEnums are considered different types. The benefit for using enums is the safety; a misspelled identifier throws a NameError. I'm finding with editor annotations that the use of type hints helps eliminate this kind of mistake, and it's my hope that we'll further concrete this once mypy is in the CI. Furthermore, defining classes for a single functions' options feels clunky vs inlining the literal types.

@jpivarski
Copy link
Member

My one "on the other hand" is that this is an L2 function and we can be strict. But on the first hand, it just seems like this will appear to downstream dependencies as deck-chair shuffling, forcing them to change their code and introduce a version dependence without an obvious benefit.

A union type is a single type. As I understand from your argument, it is the editors/IDEs that will most strongly pass on the benefit of this. (That's why a type hint is preferred over an enum.) Do the editors have a problem with union types? Does it not tab-complete or show you a menu of possible completions if the type is a union type?

@agoose77
Copy link
Collaborator Author

A union type is a single type. As I understand from your argument, it is the editors/IDEs that will most strongly pass on the benefit of this. (That's why a type hint is preferred over an enum.) Do the editors have a problem with union types? Does it not tab-complete or show you a menu of possible completions if the type is a union type?

No, LSP and other tools support unions just fine — we shouldn't discard a union for that reason.

My rationale is essentially that this should always have supported more than one option I think; dropping lengths is a deeper operation than the top level. But I'm working on a few things at the moment, and this is not the hill to die on! So, if you've not been convinced in favour of deprecation in this PR, then I'm happy to kick the bucket down the road to a future "deprecation cycle" and/or not do it :)

@jpivarski
Copy link
Member

I'll need to look into it more deeply, but I don't see how this is what is needed to fix #2764. The outer length was the only length dropped from type-tracers for a reason. (Partitioning only happens at top level.) I'll let you know if I'm wrong, but I think it can be fixed by a smaller change.

So we'll put this aside for now. I'll make this PR a draft. If #2764 can be solved without it, we can close it.

@jpivarski jpivarski marked this pull request as draft October 23, 2023 14:42
@agoose77
Copy link
Collaborator Author

The outer length was the only length dropped from type-tracers for a reason. (Partitioning only happens at top level.) I'll let you know if I'm wrong, but I think it can be fixed by a smaller change.

Whilst it's true that we envisage dropping lengths for top-level partitions, I think we now use this function more generally than that. For example, taking an existing array and converting it to typetracer will produce an array whose buffers nearly all have known lengths. There may be cases where this is intended, but others where it is not.

So, whilst we can certainly fix #2764 by doing something special for the typetracer factory method, I think handling all of the other known cases we want to support means adding more policies for length erasure.

@agoose77
Copy link
Collaborator Author

After discussing this with @jpivarski in our weekly Zoom meeting, we've concluded that lengths should never be partially forgotten — it should be all or nothing, as most of the time we don't know lengths because we don't know buffers, i.e. interior nodes cannot have known lengths.

The only case that we'd lose by choosing recursive forgetfulness is when a concrete layout is partitioned into smaller parts. That's not something we do for dask-awkward, nor is it likely that useful.

@agoose77 agoose77 closed this Oct 26, 2023
@jpivarski jpivarski deleted the agoose77/fix-drop-length-policy branch February 1, 2024 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Typetracer can't slice with (int, str, int)
2 participants