What is the best way to create a pytorch dataloader from awkward arrays? #2189

radiradev · 2023-02-01T15:08:30Z

radiradev
Feb 1, 2023

I am looking for the most natural of creating a pytorch dataloader from awkward arrays, when the collection of those arrays does not fit into memory. The arrays I'm working with contain events with a variable-number of 4 vectors.

I tried to save the files to parquet and then using the torchdata API to load them (as shown here, however I get an error:

NotImplementedError: Unsupported Arrow type: large_list<item: float not null> This exception is thrown by __iter__ of ParquetDFLoaderIterDataPipe(columns=None, device='', dtype=None, source_dp=FileListerIterDataPipe, use_threads=False)

The nvidia-merlin library looks like it's made for this exact purpose but there isn't a lot of documentation.

jpivarski · 2023-02-01T17:56:00Z

jpivarski
Feb 1, 2023
Maintainer

large_list<item: float not null> is an Arrow data type—lists built with 64-bit indexes. PyTorch's ParquetDataFrameLoader must be going through pyarrow (a very common thing to do, since Arrow supports all of Parquet's types).

There's also a list<item: float not null> type, which uses 32-bit indexes. Early in Arrow's development, this was the only list type, the argument being that if you really have 4 GB of data, it should be broken into separate partitions (row-groups in Parquet). But there are reasons why you sometimes want contiguous, large arrays, so large_list was later added.

Since it came later, large_list doesn't have as wide support as list among software products that use Arrow, and it's not too surprising that ParquetDataFrameLoader raises a NotImplementedError. They probably intend to implement it someday.

Awkward Array's lists use 64-bit signed indexes by default, but can also use 32-bit signed or unsigned. Since 64-bit is the default, many operations will end up giving you 64-bit indexes (for simplicity in implementation, actually). When converting an Awkward Array to Arrow with ak.to_arrow or Parquet with ak.to_parquet, the 64-bit Awkward lists are converted into Arrow large_list by default, since this is zero-copy.

However, both of these functions have a list_to32 argument; setting list_to32=True would make it copy and convert the 64-bit indexes into 32-bit Arrow list types (similarly for Parquet), and presumably PyTorch would be able to read that.

If PyTorch is not actually objecting to the large_list part, but the not null part, then that would have to be dealt with differently. The not null means that the Awkward type is (for example)

10000 * var * float64

instead of

10000 * var * ?float64

where the ? means that some of the values can be None, rather than a floating point number. In Awkward Array, the normal case is that data are "not nullable" (can't be missing; there must always be a number or a list there, not None). While Arrow supports this case, the normal (original) case is that all data at all levels are nullable. Just like SQL, it has to call out the not null types as unusual exceptions. Just as I'm hypothesizing that PyTorch's Parquet reader is constrained to the "normal" case of 32-bit indexes, maybe it's constrained to the "normal" case of nullable data.

I don't know if we have a nice function for promoting non-nullable data into nullable data. A hacky way to do it would be to concatenate a missing value at the right depth and then slice it off, like this:

>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> array
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

>>> ak.concatenate((array, [[None]]))
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5], [None]] type='4 * var * ?float64'>

>>> ak.concatenate((array, [[None]]))[:-1]
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * ?float64'>

There would be faster-for-the-computer ways of doing this by inserting an UnmaskedArray node in the layout, but that's not faster-for-the-user than the above.

Well, maybe this:

>>> empty_missingness = ak.Array(ak.contents.UnmaskedArray(ak.contents.EmptyArray()))[np.newaxis][:0]
>>> empty_missingness
<Array [] type='0 * 0 * ?unknown'>

>>> ak.concatenate((array, empty_missingness))
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * ?float64'>

The use of UnmaskedArray rather than [[None]] minimizes the computational cost. Both of these assume that you know how many levels deep that you want it: I'm guessing that you need to put the option-type on the float64, rather than on the var. In the empty_missingness array, that's what the np.newaxis slice does. If you needed it a level deeper, like [[[None]]], then you could slice with (np.newaxis, np.newaxis).

2 replies

jpivarski Feb 1, 2023
Maintainer

Oh, please let us know what works, in case there are other users who need to know how to prepare the Parquet file for PyTorch.

radiradev Feb 6, 2023
Author

It seemed that neither of those were what ParquetDataFrameLoader was looking for, I got NotImplementedError: Unsupported Arrow type: list<item: float> instead.

radiradev · 2023-02-06T16:00:43Z

radiradev
Feb 6, 2023
Author

After some time I was able to load the data in two ways using the torchdata API by creating two custom IterDataPipe classes. The method outlined in the documentation by padding the arrays was faster. The pytorch_geometric implementation is probably slower due to the row by row iteration within the awkward array.

Here is my code for implementing the two datapipes - one for torch_geometric and one for vanilla pytorch.

@torch.utils.data.functional_datapipe('create_graph')
class AwkwardGraph(IterDataPipe):
    def __init__(self, source_dp: IterDataPipe):
        super().__init__()
        self.source_dp = source_dp

    def __iter__(self):
        for path in self.source_dp:
            data = ak.from_parquet(path)

            for row in data:
                features, pos = self._feats_and_coords(row)
                yield torch_geometric.data.Data(pos=pos, x=features)
    
    def _feats_and_coords(self, x):
        features = self._ak_to_torch(x['energy'])
        coords = ['px', 'py', 'pz']
        pos = torch.stack([self._ak_to_torch(x[i]) for i in coords], dim=1)
        return features, pos

    def _ak_to_torch(self, x):
        return torch.from_numpy(x.to_numpy())

@torch.utils.data.functional_datapipe('create_tensor')
class AwkwardTensor(IterDataPipe):
    def __init__(self, source_dp: IterDataPipe):
        super().__init__()
        self.source_dp = source_dp

    def __iter__(self):
        for path in self.source_dp:
            data = ak.from_parquet(path)
            data = self._to_features(data)
            for row in data:
                yield row
    
    def _to_features(self, x):
        feature_names = ['px', 'py', 'pz', 'energy']
        features = torch.stack([self._pad(x[i]) for i in feature_names], dim=1)
        return features

    def _pad(self, x):
        # pad the data to be the same size
        x = ak.pad_none(x, target=100, axis=1, clip=True)
        x = ak.fill_none(x, 0)
        x = ak.to_numpy(x)
        return torch.from_numpy(x)


def create_pipe(path, batch_size, type):
    file_list = FileLister(path, '*GENIE*.parquet')
    file_list = file_list.shuffle()

    if type == 'graph':
        dp = file_list.create_graph()
        dp = dp.shuffle()
        dp = dp.batch_graphs(batch_size)
    
    elif type == 'tensor':
        dp = file_list.create_tensor()
        dp = dp.shuffle()
        dp = dp.batch(batch_size)
    
    return dp

def time_pipe(dp):
    start = time()
    for idx, data in enumerate(dp):
        # get the current time
        if idx == 0:
            end = time()
            first_time = end - start
            start = time() # reset the start time
        if idx == 100:
            end = time()
            avg_time = (end - start)/idx
            break
    
    return first_time, avg_time

pipe_types = ['graph', 'tensor']
batch_sizes = [1, 10, 100, 1000]

for pipe_type in pipe_types:
    print(pipe_type)
    for batch_size in batch_sizes:
        dp = create_pipe(path, batch_size, pipe_type)
        first_time, avg_time = time_pipe(dp)
        print(f'batch size: {batch_size}, Datapipe loading time: {first_time}, Avg batch time: {avg_time}')

And this is the output on my machine:

graph
Batch size: 1, Loading time: 6.00, Avg batch time: 1.008e-03
Batch size: 10, Loading time: 5.31, Avg batch time: 5.993e-03
Batch size: 100, Loading time: 5.41, Avg batch time: 5.678e-02
Batch size: 1000, Loading time: 6.57, Avg batch time: 5.463e-01
tensor
Batch size: 1, Loading time: 16.40, Avg batch time: 6.821e-06
Batch size: 10, Loading time: 16.64, Avg batch time: 4.721e-05
Batch size: 100, Loading time: 16.39, Avg batch time: 4.353e-04
Batch size: 1000, Loading time: 15.99, Avg batch time: 4.297e-03

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the best way to create a pytorch dataloader from awkward arrays? #2189

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What is the best way to create a pytorch dataloader from awkward arrays? #2189

radiradev Feb 1, 2023

Replies: 2 comments · 2 replies

jpivarski Feb 1, 2023 Maintainer

jpivarski Feb 1, 2023 Maintainer

radiradev Feb 6, 2023 Author

radiradev Feb 6, 2023 Author

radiradev
Feb 1, 2023

Replies: 2 comments 2 replies

jpivarski
Feb 1, 2023
Maintainer

jpivarski Feb 1, 2023
Maintainer

radiradev Feb 6, 2023
Author

radiradev
Feb 6, 2023
Author