Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add better code-docs on buffer optimisation #564

Merged
merged 17 commits into from
Dec 18, 2024

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Dec 18, 2024

Although we are soon to rework this, I'm updating the docs in case I run out of time in my other PR reviews.

This is a good opportunity to explain how buffer optimisation currently works.

In dask-awkward, there are two separate concepts:

  • buffer -- a 1D array of data that Awkward Array consumes in ak.from_buffers
  • column -- a possibly-structured "array-like" whose structure / type depends upon the IO source.

Each IO source has its own type of column in this language -- uproot has TTree keys, whilst Parquet has fields. The importance of the distinction is that remapped arrays may have a convoluted "column-buffer" relationship, e.g. arrays which share the offsets buffer from a singular IO source column.

Given that the details of buffer projection need to be defined by the IO sources (e.g. Parquet has to perform form unprojection, whilst uproot does not), it is conceptually trivial to think about these two things as separate internal-external concerns; users want to know which high-level columns are needed, whilst dask-awkward needs to know which buffers were needed.

As such, the optimisation is really more of the following conversation:

  1. dask-awkward:
    Hello uproot, can you "prepare" for buffer optimisation by giving me something to replace your input layer with, a special typetracing report, and any unknown state you might later need?
  2. uproot:
    Sure. Here's a new input layer that doesn't require any compute, here's the typetracer report you asked for, and can you please hold on to this state for me?
  3. dask-awkward:
    Sure! OK, I've now build a full graph by repeating (1) for each input. Now, I will compute it, and collect the reports and states.
  4. dask-awkward:
    Hello uproot, I have determined which buffers I need to you drop, can you give me a new input layer that only loads these buffers? Here's the state that you gave me earlier!

In this conversation, dask-awkward does not need to talk about columns at all. It also does not need any special buffer name convention besides that each buffer name is unique.

@agoose77 agoose77 changed the title docs: add better docs on column optimisation docs: add better code-docs on column optimisation Dec 18, 2024
@agoose77 agoose77 changed the title docs: add better code-docs on column optimisation docs: add better code-docs on buffer optimisation Dec 18, 2024
@agoose77 agoose77 merged commit 5e431bc into main Dec 18, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant