Saving dict of awkward arrays and re-reading it efficiently #1454

MoAly98 · 2022-04-29T16:38:12Z

MoAly98
Apr 29, 2022

Hi,

I have some data that I store in the form of a dict of awkward arrays:

akarr = ak.Array({"x": [[1,2], [3,4,5]], "y": [[6,7,8,9], [10]] })
akarr2 = ak.Array({"x": [[10,20], [30,40,50], [110] ], "y": [[60,70,80,90], [100], [120]] }) # different length to akarr
mydict = {"sample1": {"tree1":  akarr, "tree2": akarr2} , "sample2": ...}

These awkward arrays will be fairly large and can be deeply jagged. I am looking to find the most efficienct way of saving the information and re-loading it in another part of my framework. I wanted to ask the experts if you have advice about the best way to do this given the current supported tools?

Thanks,
Mo

jpivarski · 2022-04-29T18:31:01Z

jpivarski
Apr 29, 2022
Maintainer

The right answer should be to make your dict of arrays into a Record:

>>> akarr = ak.Array({"x": [[1,2], [3,4,5]], "y": [[6,7,8,9], [10]] })
>>> akarr2 = ak.Array({"x": [[10,20], [30,40,50], [110] ], "y": [[60,70,80,90], [100], [120]] })
>>> mydict = {"sample1": {"tree1":  akarr, "tree2": akarr2} , "sample2": 123}
>>> record = ak.Record(mydict)
>>> record
<Record ... 110], y: [120]}]}, sample2: 123} type='{"sample1": {"tree1": var * {...'>

Sure, the trees have different lengths and the samples can have all sorts of different shapes, but that's allowed because it's just a more complex data type.

>>> print(record.type)
{"sample1": {"tree1": var * {"x": var * int64, "y": var * int64}, "tree2": var * {"x": var * int64, "y": var * int64}}, "sample2": int64}

You can pull each of these items out, and none of these manipulations are limited by the size of any datasets.

>>> record.sample1.tree1
<Array [{x: [1, 2], y: [6, 7, ... 5], y: [10]}] type='2 * {"x": var * int64, "y"...'>

Here's the "should be" part. We should be able to save the data as Parquet and read it back again with no trouble. Parquet was made for these sorts of data structures, and our use of Parquet in Awkward v2 is expanding to include all the metadata so that nothing gets lost when you write to a file and read it back. (Parquet would then be a "first class" serialization format for Awkward Array.) Your examples don't have any metadata, though. Be sure to install a recent pyarrow, such as 6.0 or 7.0. (That will be required in v2, but v1 only requires 2.0.)

So we should be able to just write it out and read it back:

>>> ak.to_parquet(record, "output.parquet")
>>> ak.from_parquet("output.parquet")
<Array ['sample1', 'sample2'] type='2 * string'>

Okay, a bad thing happened there: we only got the names of the record field back. That's because of this bug: #1453. If I have time, I'll fix it today. What's happening is that Parquet only stores array-like data, so our to_parquet rejects records, but an internal function doesn't get the message and instead iterates over the record, which returns field names, just like iterating over a dict:

>>> list(record)
['sample1', 'sample2']
>>> list(mydict)
['sample1', 'sample2']

What to_parquet ought to do is turn a record into a length-1 array of records, save it, and then have from_parquet unpack it so that you get a record back, rather than a length-1 array. That message will have to be in metadata, so the round-trip would only be possible in v2.

Nevertheless, you can make a length-1 array for this record with the following idiom:

>>> singleton = ak.Array(record.layout.array)

(For a general record, if you don't know where it came from, you'd have to say ak.Array(record.layout.array[record.layout.at : record.layout.at + 1]), which is more of a mouthful.) See how the type of the length-1 array differs from the record itself:

>>> print(singleton.type)
1 * {"sample1": {"tree1": var * {"x": var * int64, "y": var * int64}, "tree2": var * {"x": var * int64, "y": var * int64}}, "sample2": int64}
>>> print(record.type)
{"sample1": {"tree1": var * {"x": var * int64, "y": var * int64}, "tree2": var * {"x": var * int64, "y": var * int64}}, "sample2": int64}

It's just one of those records. Anyway, now you can write it to Parquet and read it (the singleton) back:

>>> ak.to_parquet(singleton, "output.parquet")

>>> ak.from_parquet("output.parquet")
<Array [... 110], y: [120]}]}, sample2: 123}] type='1 * {"sample1": {"tree1": va...'>

>>> ak.from_parquet("output.parquet").tolist() == [
...     {"sample1": {"tree1": akarr.tolist(), "tree2": akarr2.tolist()},
...      "sample2": 123}]
True

7 replies

MoAly98 May 1, 2022
Author

I should add that this would be more signficant for me since I process the data in pieces to avoid filling memory, so I write an arbitrary number of files -- for one particular dataset (ttbar events) for example I dump 56 files. Spending 90s just writing data on my personal computer ( probably 5-10 times that on a shared machine) will be quite bad.

agoose77 May 1, 2022
Collaborator

@MoAly98 I feel like I've missed something, so bear with me!
from_iter will be slow because it doesn't know the type of data up-front. Why are you using it? Jim's solution (or my cheat variant: ak.Array(ak.packed(record).layout.array)) should be readable and writable with Parquet.

jpivarski May 2, 2022
Maintainer

from_iter is slow for two reasons: it has to discover data type as it iterates, as @agoose77 said, and because it's iterating in Python, rather than compiled code. (If the thing that you're iterating over is an Awkward Array, then there's three reasons: iteration over Awkward Arrays in Python is slowed down by complex __getitem__ logic.)

But why are you calling from_iter? Oh! Record's constructor from a dict does:

https://github.com/scikit-hep/awkward-1.0/blob/edb2ff2834ef6d7d56adeefa8628329747318446/src/awkward/highlevel.py#L1556-L1557

That's my mistake; I thought Record had a constructor from a dict like Array's, which does not call from_iter:

https://github.com/scikit-hep/awkward-1.0/blob/edb2ff2834ef6d7d56adeefa8628329747318446/src/awkward/highlevel.py#L226-L244

from_iter should be a last resort, never used on arrays that are already columnar. Either we should add a smarter constructor to Record (to check to see if what it's looking at are already arrays and not iterate over them if they are) or I should recommend something like this:

>>> ak.Array({"sample1": ak.Array({"tree1": akarr[np.newaxis], "tree2": akarr2[np.newaxis]})})
<Array [... 100]}, {x: [110], y: [120]}]}}] type='1 * {"sample1": {"tree1": 2 * ...'>

This makes a length-1 array (which also evades #1453) by adding a dimension with np.newaxis. That resolves the problem of akarr and akarr2 having different lengths.

The constructor logic in Array doesn't check for arrays two levels deep, so this calls ak.Array twice. It's hard to say where the cut-off should be; how deeply we should search for arrays (Awkward, NumPy, or other) to avoid iterating over them.

MoAly98 May 3, 2022
Author

Hi @agoose77 and @jpivarski !

I should have probably made it clearer in my reply. I only mentioned from_iter because it is a internally called in the constructor for ak.Record when a dict is passed, as Jim pointed out.

Now regarding the solution with using np.newaxis: This is what I have been doing so far actually, but the drawback I come across is when I try to parallelise the reading of the files by using partitions. The fact that all arrays are length-1 means there is only 1 partition in the file. Maybe this is naive thinking from me, but I am trying to get around long read times when reading the files I dump (especially on lxplus). At the moment, I mitigate for this a bit by only processing < 2GBs chunks of data (where possible, or just one file at a time if file size > 2GB) from ROOT and dumping them into parquet. I wanted to see if reading the individual parquet files in pieces with dask can help the read time of these parquet files.

Am I thinking about this in a wrong way? I really appreciate your input on this :)

jpivarski May 3, 2022
Maintainer

There's a tension between wanting objects with different lengths packaged together and wanting to use their lengths to divide them up into chunks for parallelization. I don't think there's a way around just packing and unpacking them. The opposite of

packed = ak.Array({"tree1": array1[np.newaxis], "tree2": array2[np.newaxis]})

is

array1 = packed.tree1[0]
array2 = packed.tree2[0]

You'd only be able to parallelize array1 and array2 separately, anyway, because they have different lengths. (They couldn't be in the same parallelization job because how would you align the elements of array1 with array2, unless you had some criterion I don't know about for mapping array1 elements with array2 elements. And if so, then that criterion could be used to make a better packed.)

Actually, if these array1 and array2 don't actually have an index-to-index relationship to each other, why put them in the same file?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving dict of awkward arrays and re-reading it efficiently #1454

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Saving dict of awkward arrays and re-reading it efficiently #1454

MoAly98 Apr 29, 2022

Replies: 1 comment · 7 replies

jpivarski Apr 29, 2022 Maintainer

MoAly98 May 1, 2022 Author

agoose77 May 1, 2022 Collaborator

jpivarski May 2, 2022 Maintainer

MoAly98 May 3, 2022 Author

jpivarski May 3, 2022 Maintainer

MoAly98
Apr 29, 2022

Replies: 1 comment 7 replies

jpivarski
Apr 29, 2022
Maintainer

MoAly98 May 1, 2022
Author

agoose77 May 1, 2022
Collaborator

jpivarski May 2, 2022
Maintainer

MoAly98 May 3, 2022
Author

jpivarski May 3, 2022
Maintainer