-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: One-pass column optimization #491
base: main
Are you sure you want to change the base?
Conversation
else: | ||
paths.add(form_key_to_path[form_key]) | ||
|
||
# Select the most appropriate column for each wildcard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martindurant is the plan to restore e.g. wildcard matching in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that happens in this PR, this one's just about auto-selecting required columns. We might need more work for the case where the shape of something is required but not the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only surface-glancing, but I noted that this logic has been removed and didn't see anywhere that it was recovered. Might have missed something!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this should live after checking shape_touched_in, more complex than I had hoped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK< I have some shonky code in optimize.optimize_columns to deal with shapes. I am probably still missing some touch-buffers on the outputs, but most things are in line.
It's all pretty ugly and in the latest commit I seem to have introduced some typetracer->numpy coersion.
src/dask_awkward/lib/optimize.py
Outdated
def _buf_to_col(s): | ||
return ( | ||
s[2:] | ||
.replace(".content", "") | ||
.replace("-offsets", "") | ||
.replace("-data", "") | ||
.replace("-index", "") | ||
.replace("-mask", "") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have time right now to look at the full PR here, but is this assumption about buffer name mappings enforced anywhere? Right now the design is such that input sources can use a non-canonical buffer naming scheme.
For coffea I think we made this possible so that Coffea could directly pass in their own buffer names with semantic meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have access to the call to ak.from_buffers
here? That function takes a callable that constructs the buffer names from the form_key
s and these roles.
The roles are fixed words like this, and Coffea only has the power to change the form_key
, not the role. This is almost the complete set of such names; only -tags
(for UnionArray) is missing.
@jpivarski I started writing a reply here, but I think I should just aim for explicit verbosity, and hope we end up on the same page. @martindurant this might be useful too, not sure. When writing the current column optimisation API for dask-awkward, I had the following intentions:
Crucially, the optimisation strategies differ between Parquet, ROOT, and remapped-ROOT:
To handle all of this complexity, I made column optimisation in Dask Awkward a thin contract; the strategy for optimisation remains an implementation-specific feature. We implement an optimisation strategy for "column" (pick-one-child) formats, but not for the ROOT/Remapped-ROOT cases.
As such, Dask Awkward doesn't need to know about buffer / form keys in its optimisation machinery. The logic in I think where possible we should keep this "black-box" approach, and have the IO-func perform the strategy for optimisation. I'll attend the dask-awkward meeting this Friday in case I can be of any more help! (And try to find some more cycles on this). |
I think you are effectively saying we should move the text manipulation to the IOFunction superclass so it can be overridden - which I am fine with. I don't otherwise see the consequence of what you are saying and why you need extra methods. As far as I could tell, the old technique was al about making layers that could go through the "no consequence" run of the graph at optimization time. From my point of view, all IO layers should implement project(), but with the option that it does nothing. (Or, alternatively, a meaningful project() could be what defines an IO Layer, since this is the only thing we need them for). Your engagement is well appreciated! I hope you can help push this through, as it has turned out much harder than I had initially thought, and the implicit contracts in the various protocols are exactly what confused me the most. Now, I believe the only issue with the current tests is some cached state: either the explicit caches, or a missing commit(). |
I have a lot of sympathy for this, it's a rabbit hole.
Yes, in writing my reply I came to realise that the uproot code is really adding to the complexity of thinking about this, even though it's effectively separate. Having to consider coffea-via-uproot on awkward-dask is hard.
Will be glad too. |
I'm also mindful that we want to be clear about what "column" means; I think dask-awkward should/can only reason about buffers, and it's the IO-func's job to figure out whether that means "columns" (i.e. coffea reports the TTree branches for |
@martindurant also, I'd like to apologise for how long it took to get that detailed reply onto paper; my personal life has had quite a lot going on recently (new house among them!) that means I've had much less spare time than anticipated. |
OK, so agreed: we can maintain the buffer names while optimising, and only normalise to the IO-specific column names within the IOFunc's project() method, so that different backends can handle that differently. Also I can get rid of the "@." part, which I copied from the previous optimisation code and I don't think it does anything. @jpivarski , this is a place where "."-containing field names might be a problem:
|
done |
I'll hopefully take a look at this before our meeting tomorrow :) |
Ping @agoose77 , do you think you have time to look at this? |
@martindurant one question that came to me a while ago (only just actioning now) is - why do we store the report on the layer? Can we not just retrieve it from the assert ak.backend(array) == "typetracer"
form, length, container = ak.to_buffers(array)
reports = {b.report for b in container.values() if b is not None} |
If we only need to store the meta on the layer, fine. But we cannot access the meta of the collection (dak.Array) at optimize time, since dask only passes the combined graph of all the collections that were passed to compute(). |
@lgray , this PR is finally pretty close. Could you do some speed-tests/profiles on moderate workloads to see if a) they work as expected and b) we really achieve what we expect? It should be, that the total wall time is decently improved, but more importantly, that the compute() call is decidedly faster, since the optimize phase is mostly spread through the graph building steps (each of which will be slightly slower). |
I'll probably need this test to function:
To be able to run any of my tests. |
Oh, right - fair enough. I'll look at what patch root needs. In this codebase, of course we only ensure parquet and json. |
…lled (except the uproot test itself)
tests/test_inspect.py::test_basic_root_works is fixed with scikit-hep/uproot5#1225 The two remaining failures are:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an initial-thoughts review. I'll propose some concrete changes where I can in the next few days. The main things to tackle in my view are:
- Re-implement
necessary_columns
in terms of the IO function
- "columns" in my view (and hence the existing implementation) represents the IO-function's definition of a column, i.e. it's different between uproot, and Parquet/JSON.
- We should ask the IO-function to translate buffer names to column names e.g. https://github.com/scikit-hep/uproot5/blob/734700ef1f822338b03a7573df484909b317b2c2/src/uproot/_dask.py#L1077
- Understand how coffea & uproot need to change to match this new dask-awkward-driven buffer mapping
There's probably more to think about, but as I say this is just an initial thought. I recognise that the coupling between coffea and uproot is confusing, so I'll be sure to start with that.
src/dask_awkward/lib/core.py
Outdated
@@ -518,6 +523,7 @@ def f(self, other): | |||
meta = op(self._meta, other._meta) | |||
else: | |||
meta = op(self._meta, other) | |||
[_.commit(name) for _ in self.report] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's lift this into a utility function e.g.
def commit_to_reports(name, reports):
for report in reports:
report.commit(name)
@@ -398,6 +399,10 @@ def name(self) -> str: | |||
def key(self) -> Key: | |||
return (self._name, 0) | |||
|
|||
@property | |||
def report(self): | |||
return getattr(self._meta, "_report", set()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we lift _report
from the meta to the Layer? I know that it's more state to keep in sync, but I don't love the act of tagging non-reserved attributes onto an Awkward Array, and this will better separate the stateful part of typetracer from the stateless part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpivarski I wonder whether we should add a new merge
function to TypeTracerReport
. Due to the uniqueness of layer IDs, and the existing association of reports with layouts, we can replace the need to carry a set of reports with a single report via
merge(A, B) -> A
I haven't yet thought about the memory implication for effectively having "larger" reports, but this might be a nicer UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These Awkward Arrays are not accessible to users ("_meta
" has an underscore); they're internal, bookkeeping objects. There's a similar use of temporary attributes in conversions from Arrow:
and
We're using the convenience of Python objects as dicts to handle a bookkeeping problem that would get more complicated if we were to try to do it in a statically typable way (although we could also make these "backdoors" into nullable, private attributes that are unused in almost all other circumstances).
An attempt to do this a bit more by-the-book with Indexes unfortunately made too much noise. The Index class has an arbitrary metadata
that can be used to stash whatever we need:
although it was only ever used for one thing: associating a multidimensional integer array slice with an Index, which is one-dimensional.
That's pretty obscure, but Index now has this mechanism everywhere, and it must make people wonder why it's there.
A hacky solution, like attaching temporary metadata to a Python object as though it were a dict, can prevent an implementation detail about a corner case from getting "noisy," alerting all parts of the code about its existence. If the hacky solution is implemented incorrectly and the hack spreads, then we'd be wishing we had that noise to help understand what's going wrong, but if it's implemented correctly and no one finds out what happens in Vegas, making it formal would obfuscate the normal case with an outlier.
So I'd be in favor of leaving in the _report
attribute handling, especially if something ensures that it never leaves the graph-optimization phase. The example above with a __pyarrow_original
attribute (which includes its context in its name, by the way) has an explicit, recursive procedure to ensure that no layouts escape ak.from_arrow
with that attribute intact. I don't know if there's a similar guarantee for _report
.
src/dask_awkward/lib/io/io.py
Outdated
array_meta, report = typetracer_with_report( | ||
form_with_unique_keys(io_func.form, "@"), | ||
highlevel=True, | ||
behavior=io_func.behavior, | ||
buffer_key=render_buffer_key, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the big change:
Control over buffer names now solely rests with dask-awkward, rather than the provider of the IO function e.g. uproot.
This means that dask-awkward generates buffer names and gives them to the IO-function, meaning that the IO-functions now need to understand how to parse these keys instead of dask-awkward asking the IO-functions to parse them. This is all semantics; this is a trivial operation and we can define public functions in dask-awkward to do this.
Defining our column-optimisation interface at the Form level still ensures unique names, but it now requires that the IO-function understands how to map from the form to the buffers it works.1 A nice perk of this change is that because dask-awkward now controls buffer names, we can ensure that they are globally unique (i.e. between different input layers) e.g. by embedding the layer ID. This might prove useful.
Footnotes
-
To understand the consequence of this, I need to more intimately remind myself how coffea currently remaps keys (where keys are VM instructions) to that which it presents to uproot. ↩
src/dask_awkward/lib/optimize.py
Outdated
def _optimize_columns(dsk, all_layers): | ||
for k, lay in dsk.copy().items(): | ||
if not isinstance(lay, AwkwardInputLayer) or not hasattr( | ||
lay.io_func, "_column_report" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just like with _report
, I'd like to formalise this as part of the API. This would become a part of the column-projection interface.
src/dask_awkward/lib/optimize.py
Outdated
for c in second.copy(): | ||
if "." in c: | ||
parent = c.rsplit(".", 1)[0] | ||
second.discard(parent) | ||
out[k] = sorted(second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic assumes that buffer names are .
delimited and are hierarchical, i.e. foo.bar
is below foo
. This isn't true for form remapping, which can move arbitrary buffers around.
That is to say, the real operation we want to implement is necessary_buffers
which happens to be necessary_columns
for the Parquet/JSON schemes.
I don't seem to be able to push to the repo, so here's a patch for the reports helper.1 Footnotes
|
First attempt at this on a realistic analysis yields massive amounts of over-touching. new (dak==this pr, uproot==martin's branch):
old (this has been confirmed to be both correct and minimal, dak==5.3.0, uproot=5.3.7):
This is after a few transformations on input data to add some useful columns and such. However, it did work and certain portions were noticeably rather faster! Some slower. |
Only three failures:
|
@LGrey - moved from IO reporting layers returning a tuple to returning a record (or dict) for my own sanity.
Merge main into one-pass
for more information, see https://pre-commit.ci
To-do:
figure out what text manipulation is really requiredreinstate necessary_columns based on optimize_columns