Building arrays of a specified dtype #328
Replies: 7 comments
-
You're right to start with the high-level functions ( The >>> import awkward1 as ak
>>> ak.Array([[1, 2, 3], [0, 4]])
<Array [[1, 2, 3], [0, 4]] type='2 * var * int64'> is laid out in memory as >>> ak.Array([[1, 2, 3], [0, 4]]).layout
<ListOffsetArray64>
<offsets><Index64 i="[0 3 5]" offset="0" length="3" at="0x55b581801c80"/></offsets>
<content><NumpyArray format="l" shape="5" data="1 2 3 0 4" at="0x55b581803c90"/></content>
</ListOffsetArray64> These classes, ListOffsetArray, Index, NumpyArray, etc., are documented here. They're part of the public API (hence, no underscores), but not what you'd usually use in analysis. You want to build something just like this but with 8-bit integers instead of 64-bit integers. You can use the layout classes' constructors directly: >>> import numpy as np
>>> content = ak.layout.NumpyArray(np.array([1, 2, 3, 0, 4], np.int8))
>>> index = ak.layout.Index32(np.array([0, 3, 5], np.int32))
>>> listoffsetarray = ak.layout.ListOffsetArray32(index, content)
>>> listoffsetarray
<ListOffsetArray32>
<offsets><Index32 i="[0 3 5]" offset="0" length="3" at="0x55b581345b30"/></offsets>
<content><NumpyArray format="b" shape="5" data="0x 01020300 04" at="0x55b581313990"/></content>
</ListOffsetArray32> (Note: integer types for Index and the classes that use Indexes are fairly limited: 32-bit, unsigned 32-bit, and 64-bit. The ListOffsetArray object won't do all the things you'd like to do: >>> np.square(listoffsetarray)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for *: 'awkward1._ext.NumpyArray' and 'awkward1._ext.NumpyArray' but you can wrap this up in an >>> array = ak.Array(listoffsetarray)
>>> array
<Array [[1, 2, 3], [0, 4]] type='2 * var * int8'>
>>> np.square(array)
<Array [[1, 4, 9], [0, 16]] type='2 * var * int8'> (Working through this example, taking square roots of I hope this answers your question; if it's satisfactory, I'd like to close this issue because in the old repo I ended up with a lot of old open-but-not-active issues and I'd like to restrict the open issues to just the ones that require some action. Do you need anything more than the above? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the very quick and detailed answer! Your solution with Is there any way of appending new rows (lists) to the array while still preserving the dtype? |
Beta Was this translation helpful? Give feedback.
-
Awkward arrays are immutable objects; the If it's okay to grow your dataset in chunks, you could use ak.concatenate to merge them. That's a copy, though, as it makes a new physically contiguous array. Depending on your performance needs, you might not want that. Another alternative (also assuming you can grow your dataset in chunks) is to use a PartitionedArray. PartitionedArray is a layout type, and unlike the others, it can only be applied at the root of a structure. It allows you to logically join physically disjoint arrays, and thus build up a dataset from chunks without copying. The cost is a little indirection when you access it, and operations involving any PartitionedArrays might need to implicitly repartition to ensure that all arrays have the same partitioning. If your calculations are all derived from a single partitioned array, they'll generally be partitioned the same way (especially if you use masking rather than filtering with boolean arrays, as this maintains the length of the array while crossing out the element you don't want to use). There is a high-level function for constructing a partitioned array, ak.partitioned, though I'm unsure if it has the right interface. (It asks for a function to generate partitions and a number of partitions; it might be more natural to just take a list of arrays, as a "soft" alternative to |
Beta Was this translation helpful? Give feedback.
-
Okay, I guess I'll just continue creating the nested lists in numba and then convert them to ragged arrays in the end. I'll try to look into partitioned arrays as you mention, but for now I think I'll just use the following small helper function (written below in case others might need something similar). Thanks a lot for the detailed, quick answers! You may close the topic now :) @njit
def flatten_nested_list(nested_list):
res = List()
for lst in nested_list:
for x in lst:
res.append(x)
return np.asarray(res)
@njit
def get_cumulative_indices(nested_list, index_dtype=np.int32):
index = np.zeros(len(nested_list)+1, index_dtype)
for i, lst in enumerate(nested_list):
index[i+1] = index[i] + len(lst)
return index
def nested_list_to_awkward_array(nested_list, index_dtype=np.int32):
content = ak.layout.NumpyArray(flatten_nested_list(nested_list))
index = ak.layout.Index32(get_cumulative_indices(nested_list, index_dtype))
listoffsetarray = ak.layout.ListOffsetArray32(index, content)
array = ak.Array(listoffsetarray)
return array |
Beta Was this translation helpful? Give feedback.
-
I think this is a good solution. Using Numba to make pieces of an Awkward array or indexes to slice an Awkward array is more performant than using Good luck in your project! |
Beta Was this translation helpful? Give feedback.
-
You might be interested in the |
Beta Was this translation helpful? Give feedback.
-
Oh, cool, I'll look into that! |
Beta Was this translation helpful? Give feedback.
-
Hi,
First and foremost thanks for a really useful tool!
I am having trouble specifying the dtype for awkward arrays. For simple arrays it works well, e.g:
Here both methods preserve the dtype:
<Array [0, 1, 2, 3, 4] type='5 * int8'>
(compared to e.g.ak.from_iter(x)
:<Array [0, 1, 2, 3, 4] type='5 * int64'>
. )My problem is that I have to create nested lists (or, preferably, ragged arrays) of specific dtype (to reduce memory).
Right now I make the nested lists in numba:
which gives the following output:
How do I convert such a nested list to an awkward array while still keeping the correct dtype?
I have tried a few different things:
1.
In:
Out:
2.
In:
Out:
3.
In:
Out:
As I see it, the ArrayBuilder would be the way to go, since I then don't have to create the nested list in Numba and then convert it to an awkward-array, however, looking at the documentation of it I fail to see how one specify the dtype (other than int/float/bool/etc.).
Do you have any suggestions about which way to go about creating ragged arrays of specific dtypes - either directly with awkward and the ArrayBuilder or by converting the nested lists to awkward arrays?
Thanks a lot!
Cheers,
Christian
Beta Was this translation helpful? Give feedback.
All reactions