-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add policy for length erasure #2765
Conversation
Codecov Report
Additional details and impacted files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an L2 public function, so we can, in principle, be strict with users. However, it doesn't seem to be necessary here. Changing
forget_length: bool
into
length_policy: Literal["keep", "drop_outer", "drop_recursive"]
requires calling functions to change, so so there's a deprecation cycle. But it could have been changed to
forget_length: bool | Literal["false", "true", "recursive"]
such that False
corresponds to "false"
, True
corresponds to "true"
(dropping just the outer length, which is the old behavior), and "recursive"
is the only new behavior: dropping lengths recursively.
That way, there would be no need to make downstream libraries change. Since we're strictly extending the old behavior (new set of options is a superset of the old set of options) and there's a natural way to keep using the old words while introducing a new word, this is less of an imposition. Only downstream libraries that want the new behavior need to do anything; others don't even need to know about the change. (In addition, I have a more immediate understanding of what "forget length" means than "length policy".)
As another alternative, perhaps
forget_length: bool | Literal["recursive"]
to be tighter? It depends on whether it's important for all options to have identical type. (Well, they do, even now: it's a union type.)
Also, why string literals and not enums? Do any enums naturally map onto false and true?
My personal preference is fairly strongly against mixing types like this, especially at L2 where we can move further away from interactive-friendly APIs towards stricter typing.
That is a benefit, though, of gradually extending the argument.
Naming things is hard :( I like
Enum's don't play as nicely with typing as I would like. I don't think you can type a function such that it accepts a literal string or an enum value; even |
My one "on the other hand" is that this is an L2 function and we can be strict. But on the first hand, it just seems like this will appear to downstream dependencies as deck-chair shuffling, forcing them to change their code and introduce a version dependence without an obvious benefit. A union type is a single type. As I understand from your argument, it is the editors/IDEs that will most strongly pass on the benefit of this. (That's why a type hint is preferred over an enum.) Do the editors have a problem with union types? Does it not tab-complete or show you a menu of possible completions if the type is a union type? |
No, LSP and other tools support unions just fine — we shouldn't discard a union for that reason. My rationale is essentially that this should always have supported more than one option I think; dropping lengths is a deeper operation than the top level. But I'm working on a few things at the moment, and this is not the hill to die on! So, if you've not been convinced in favour of deprecation in this PR, then I'm happy to kick the bucket down the road to a future "deprecation cycle" and/or not do it :) |
I'll need to look into it more deeply, but I don't see how this is what is needed to fix #2764. The outer length was the only length dropped from type-tracers for a reason. (Partitioning only happens at top level.) I'll let you know if I'm wrong, but I think it can be fixed by a smaller change. So we'll put this aside for now. I'll make this PR a draft. If #2764 can be solved without it, we can close it. |
Whilst it's true that we envisage dropping lengths for top-level partitions, I think we now use this function more generally than that. For example, taking an existing array and converting it to typetracer will produce an array whose buffers nearly all have known lengths. There may be cases where this is intended, but others where it is not. So, whilst we can certainly fix #2764 by doing something special for the typetracer factory method, I think handling all of the other known cases we want to support means adding more policies for length erasure. |
After discussing this with @jpivarski in our weekly Zoom meeting, we've concluded that lengths should never be partially forgotten — it should be all or nothing, as most of the time we don't know lengths because we don't know buffers, i.e. interior nodes cannot have known lengths. The only case that we'd lose by choosing recursive forgetfulness is when a concrete layout is partitioned into smaller parts. That's not something we do for dask-awkward, nor is it likely that useful. |
Fix #2764 by adding support for recursive erasure of length information.
Instead of
forget_length: bool
, we haveforget_length: Literal["keep", "drop_outer", "drop_recursive"]
, which enables us to decide how deep the lengths are forgotten. For theForm → Content
case, we always want recursively to drop lengths.