Replies: 1 comment 7 replies
-
The right answer should be to make your dict of arrays into a Record: >>> akarr = ak.Array({"x": [[1,2], [3,4,5]], "y": [[6,7,8,9], [10]] })
>>> akarr2 = ak.Array({"x": [[10,20], [30,40,50], [110] ], "y": [[60,70,80,90], [100], [120]] })
>>> mydict = {"sample1": {"tree1": akarr, "tree2": akarr2} , "sample2": 123}
>>> record = ak.Record(mydict)
>>> record
<Record ... 110], y: [120]}]}, sample2: 123} type='{"sample1": {"tree1": var * {...'> Sure, the trees have different lengths and the samples can have all sorts of different shapes, but that's allowed because it's just a more complex data type. >>> print(record.type)
{"sample1": {"tree1": var * {"x": var * int64, "y": var * int64}, "tree2": var * {"x": var * int64, "y": var * int64}}, "sample2": int64} You can pull each of these items out, and none of these manipulations are limited by the size of any datasets. >>> record.sample1.tree1
<Array [{x: [1, 2], y: [6, 7, ... 5], y: [10]}] type='2 * {"x": var * int64, "y"...'> Here's the "should be" part. We should be able to save the data as Parquet and read it back again with no trouble. Parquet was made for these sorts of data structures, and our use of Parquet in Awkward v2 is expanding to include all the metadata so that nothing gets lost when you write to a file and read it back. (Parquet would then be a "first class" serialization format for Awkward Array.) Your examples don't have any metadata, though. Be sure to install a recent pyarrow, such as 6.0 or 7.0. (That will be required in v2, but v1 only requires 2.0.) So we should be able to just write it out and read it back: >>> ak.to_parquet(record, "output.parquet")
>>> ak.from_parquet("output.parquet")
<Array ['sample1', 'sample2'] type='2 * string'> Okay, a bad thing happened there: we only got the names of the record field back. That's because of this bug: #1453. If I have time, I'll fix it today. What's happening is that Parquet only stores array-like data, so our >>> list(record)
['sample1', 'sample2']
>>> list(mydict)
['sample1', 'sample2'] What Nevertheless, you can make a length-1 array for this record with the following idiom: >>> singleton = ak.Array(record.layout.array) (For a general record, if you don't know where it came from, you'd have to say >>> print(singleton.type)
1 * {"sample1": {"tree1": var * {"x": var * int64, "y": var * int64}, "tree2": var * {"x": var * int64, "y": var * int64}}, "sample2": int64}
>>> print(record.type)
{"sample1": {"tree1": var * {"x": var * int64, "y": var * int64}, "tree2": var * {"x": var * int64, "y": var * int64}}, "sample2": int64} It's just one of those records. Anyway, now you can write it to Parquet and read it (the >>> ak.to_parquet(singleton, "output.parquet")
>>> ak.from_parquet("output.parquet")
<Array [... 110], y: [120]}]}, sample2: 123}] type='1 * {"sample1": {"tree1": va...'>
>>> ak.from_parquet("output.parquet").tolist() == [
... {"sample1": {"tree1": akarr.tolist(), "tree2": akarr2.tolist()},
... "sample2": 123}]
True |
Beta Was this translation helpful? Give feedback.
-
Hi,
I have some data that I store in the form of a dict of awkward arrays:
These awkward arrays will be fairly large and can be deeply jagged. I am looking to find the most efficienct way of saving the information and re-loading it in another part of my framework. I wanted to ask the experts if you have advice about the best way to do this given the current supported tools?
Thanks,
Mo
Beta Was this translation helpful? Give feedback.
All reactions