fix: add policy for length erasure #2765

agoose77 · 2023-10-22T15:50:57Z

Fix #2764 by adding support for recursive erasure of length information.

Instead of forget_length: bool, we have forget_length: Literal["keep", "drop_outer", "drop_recursive"], which enables us to decide how deep the lengths are forgotten. For the Form → Content case, we always want recursively to drop lengths.

Warning
This PR introduces a deprecation in the forget_length argument.

codecov · 2023-10-22T15:57:25Z

Codecov Report

Merging #2765 (a5d0892) into main (fa4b2bc) will decrease coverage by 0.01%.
The diff coverage is 94.11%.

Additional details and impacted files

Files	Coverage Δ
src/awkward/_nplikes/typetracer.py	`76.47% <ø> (ø)`
src/awkward/contents/bitmaskedarray.py	`69.25% <100.00%> (+0.09%)`	⬆️
src/awkward/contents/bytemaskedarray.py	`89.07% <100.00%> (ø)`
src/awkward/contents/emptyarray.py	`75.12% <100.00%> (ø)`
src/awkward/contents/indexedarray.py	`78.99% <100.00%> (ø)`
src/awkward/contents/indexedoptionarray.py	`88.48% <100.00%> (ø)`
src/awkward/contents/listarray.py	`88.38% <100.00%> (ø)`
src/awkward/contents/listoffsetarray.py	`82.77% <100.00%> (ø)`
src/awkward/contents/numpyarray.py	`91.22% <100.00%> (ø)`
src/awkward/contents/recordarray.py	`84.55% <100.00%> (ø)`
... and 8 more

jpivarski

This is an L2 public function, so we can, in principle, be strict with users. However, it doesn't seem to be necessary here. Changing

forget_length: bool

into

length_policy: Literal["keep", "drop_outer", "drop_recursive"]

requires calling functions to change, so so there's a deprecation cycle. But it could have been changed to

forget_length: bool | Literal["false", "true", "recursive"]

such that False corresponds to "false", True corresponds to "true" (dropping just the outer length, which is the old behavior), and "recursive" is the only new behavior: dropping lengths recursively.

That way, there would be no need to make downstream libraries change. Since we're strictly extending the old behavior (new set of options is a superset of the old set of options) and there's a natural way to keep using the old words while introducing a new word, this is less of an imposition. Only downstream libraries that want the new behavior need to do anything; others don't even need to know about the change. (In addition, I have a more immediate understanding of what "forget length" means than "length policy".)

As another alternative, perhaps

forget_length: bool | Literal["recursive"]

to be tighter? It depends on whether it's important for all options to have identical type. (Well, they do, even now: it's a union type.)

Also, why string literals and not enums? Do any enums naturally map onto false and true?

agoose77 · 2023-10-23T13:38:49Z

This is an L2 public function, so we can, in principle, be strict with users. However, it doesn't seem to be necessary here. Changing
forget_length: bool
into
length_policy: Literal["keep", "drop_outer", "drop_recursive"]
requires calling functions to change, so so there's a deprecation cycle. But it could have been changed to
forget_length: bool | Literal["false", "true", "recursive"]
such that False corresponds to "false", True corresponds to "true" (dropping just the outer length, which is the old behavior), and "recursive" is the only new behavior: dropping lengths recursively.

My personal preference is fairly strongly against mixing types like this, especially at L2 where we can move further away from interactive-friendly APIs towards stricter typing.

That way, there would be no need to make downstream libraries change. Since we're strictly extending the old behavior (new set of options is a superset of the old set of options) and there's a natural way to keep using the old words while introducing a new word, this is less of an imposition. Only downstream libraries that want the new behavior need to do anything; others don't even need to know about the change.

That is a benefit, though, of gradually extending the argument.

(In addition, I have a more immediate understanding of what "forget length" means than "length policy".)

Naming things is hard :( I like policy because it implies to me that there's more than two options.

As another alternative, perhaps
forget_length: bool | Literal["recursive"]
to be tighter? It depends on whether it's important for all options to have identical type. (Well, they do, even now: it's a union type.)

Also, why string literals and not enums? Do any enums naturally map onto false and true?

Enum's don't play as nicely with typing as I would like. I don't think you can type a function such that it accepts a literal string or an enum value; even StrEnums are considered different types. The benefit for using enums is the safety; a misspelled identifier throws a NameError. I'm finding with editor annotations that the use of type hints helps eliminate this kind of mistake, and it's my hope that we'll further concrete this once mypy is in the CI. Furthermore, defining classes for a single functions' options feels clunky vs inlining the literal types.

jpivarski · 2023-10-23T14:15:14Z

My one "on the other hand" is that this is an L2 function and we can be strict. But on the first hand, it just seems like this will appear to downstream dependencies as deck-chair shuffling, forcing them to change their code and introduce a version dependence without an obvious benefit.

A union type is a single type. As I understand from your argument, it is the editors/IDEs that will most strongly pass on the benefit of this. (That's why a type hint is preferred over an enum.) Do the editors have a problem with union types? Does it not tab-complete or show you a menu of possible completions if the type is a union type?

agoose77 · 2023-10-23T14:19:48Z

A union type is a single type. As I understand from your argument, it is the editors/IDEs that will most strongly pass on the benefit of this. (That's why a type hint is preferred over an enum.) Do the editors have a problem with union types? Does it not tab-complete or show you a menu of possible completions if the type is a union type?

No, LSP and other tools support unions just fine — we shouldn't discard a union for that reason.

My rationale is essentially that this should always have supported more than one option I think; dropping lengths is a deeper operation than the top level. But I'm working on a few things at the moment, and this is not the hill to die on! So, if you've not been convinced in favour of deprecation in this PR, then I'm happy to kick the bucket down the road to a future "deprecation cycle" and/or not do it :)

jpivarski · 2023-10-23T14:42:05Z

I'll need to look into it more deeply, but I don't see how this is what is needed to fix #2764. The outer length was the only length dropped from type-tracers for a reason. (Partitioning only happens at top level.) I'll let you know if I'm wrong, but I think it can be fixed by a smaller change.

So we'll put this aside for now. I'll make this PR a draft. If #2764 can be solved without it, we can close it.

agoose77 · 2023-10-23T14:50:05Z

The outer length was the only length dropped from type-tracers for a reason. (Partitioning only happens at top level.) I'll let you know if I'm wrong, but I think it can be fixed by a smaller change.

Whilst it's true that we envisage dropping lengths for top-level partitions, I think we now use this function more generally than that. For example, taking an existing array and converting it to typetracer will produce an array whose buffers nearly all have known lengths. There may be cases where this is intended, but others where it is not.

So, whilst we can certainly fix #2764 by doing something special for the typetracer factory method, I think handling all of the other known cases we want to support means adding more policies for length erasure.

agoose77 · 2023-10-26T09:43:04Z

After discussing this with @jpivarski in our weekly Zoom meeting, we've concluded that lengths should never be partially forgotten — it should be all or nothing, as most of the time we don't know lengths because we don't know buffers, i.e. interior nodes cannot have known lengths.

The only case that we'd lose by choosing recursive forgetfulness is when a concrete layout is partitioned into smaller parts. That's not something we do for dask-awkward, nor is it likely that useful.

agoose77 added 3 commits October 22, 2023 16:39

fix: introduce policy for dropping lengths

f1ef764

test: check out length erasure

69b3d4b

test: update/add tests

a5d0892

agoose77 requested a review from jpivarski October 22, 2023 15:50

agoose77 deployed to docs October 22, 2023 16:03 — with GitHub Actions View deployment

jpivarski reviewed Oct 23, 2023

View reviewed changes

jpivarski marked this pull request as draft October 23, 2023 14:42

agoose77 closed this Oct 26, 2023

agoose77 mentioned this pull request Oct 26, 2023

fix: drop lengths recursively #2775

Merged

jpivarski deleted the agoose77/fix-drop-length-policy branch February 1, 2024 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add policy for length erasure #2765

fix: add policy for length erasure #2765

agoose77 commented Oct 22, 2023 •

edited

Loading

codecov bot commented Oct 22, 2023

jpivarski left a comment

agoose77 commented Oct 23, 2023

jpivarski commented Oct 23, 2023

agoose77 commented Oct 23, 2023

jpivarski commented Oct 23, 2023

agoose77 commented Oct 23, 2023

agoose77 commented Oct 26, 2023

fix: add policy for length erasure #2765

fix: add policy for length erasure #2765

Conversation

agoose77 commented Oct 22, 2023 • edited Loading

codecov bot commented Oct 22, 2023

Codecov Report

jpivarski left a comment

Choose a reason for hiding this comment

agoose77 commented Oct 23, 2023

jpivarski commented Oct 23, 2023

agoose77 commented Oct 23, 2023

jpivarski commented Oct 23, 2023

agoose77 commented Oct 23, 2023

agoose77 commented Oct 26, 2023

agoose77 commented Oct 22, 2023 •

edited

Loading