NLP - Iterable dataset that shuffles documents but batches text segments properly to account for dependencies? #5657

LarsHill · 2023-03-21T17:28:00Z

LarsHill
Mar 21, 2023

Given I have a dataset with 10 documents and each document contains between 3 and 10 text segments, e.g.

['1_0', '1_1', '1_2', '1_3', '1_4', '1_5', '1_6', '1_7']
['2_0', '2_1', '2_2', '2_3', '2_4', '2_5', '2_6', '2_7', '2_8']
['3_0', '3_1', '3_2', '3_3', '3_4', '3_5', '3_6', '3_7', '3_8', '3_9']
['4_0', '4_1', '4_2', '4_3', '4_4', '4_5']
['5_0', '5_1', '5_2', '5_3', '5_4', '5_5', '5_6', '5_7', '5_8', '5_9']
['6_0', '6_1', '6_2', '6_3', '6_4', '6_5', '6_6', '6_7', '6_8', '6_9']
['7_0', '7_1', '7_2', '7_3']
['8_0', '8_1', '8_2']
['9_0', '9_1', '9_2', '9_3', '9_4', '9_5', '9_6']
['10_0', '10_1', '10_2', '10_3', '10_4', '10_5', '10_6', '10_7', '10_8']

A text segment can be for example a paragraph.
Is there an easy way to create a dataset from this file that supports document aware batching, given that subsequent text segments of the same document should occur in subsequent samples in the same batch?
So for the above example and with a batch size of 4 I would expect an output like this:

(['3_0', '5_0', '6_0', '2_0'])
(['3_1', '5_1', '6_1', '2_1'])
(['3_2', '5_2', '6_2', '2_2'])
(['3_3', '5_3', '6_3', '2_3'])
(['3_4', '5_4', '6_4', '2_4'])
(['3_5', '5_5', '6_5', '2_5'])
(['3_6', '5_6', '6_6', '2_6'])
(['3_7', '5_7', '6_7', '2_7'])
(['3_8', '5_8', '6_8', '2_8'])
(['3_9', '5_9', '6_9', '10_0'])
(['1_0', '9_0', '4_0', '10_1'])
(['1_1', '9_1', '4_1', '10_2'])
(['1_2', '9_2', '4_2', '10_3'])
(['1_3', '9_3', '4_3', '10_4'])
(['1_4', '9_4', '4_4', '10_5'])
(['1_5', '9_5', '4_5', '10_6'])
(['1_6', '9_6', '7_0', '10_7'])
(['1_7', '8_0', '7_1', '10_8'])
([None, '8_1', '7_2', None])
([None, '8_2', '7_3', None])

As you can see shuffling would be done on a document level and all documents are more or less equally distributed over all batches so that the amount of padding with None segments is minimized. With such a dataset/dataloader one could model dependencies between samples per batch, given the previous segment is from the same document.
Would it be possible to create something like this with the current datasets package? Or is it always assumed that all samples are completely independent?
If it is not supported, do you think it makes sense to integrate such functionality?
I would assume something like this would benefit in training models like transformer-xl where some dependencies between text segments are expected.

mariosasko · 2023-03-22T18:39:04Z

mariosasko
Mar 22, 2023
Collaborator

Hi!

You can use .shuffle to shuffle the dataset, but the batching logic needs to be implemented manually.

I find this batching procedure too specific to be integrated into datasets though.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLP - Iterable dataset that shuffles documents but batches text segments properly to account for dependencies? #5657

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

NLP - Iterable dataset that shuffles documents but batches text segments properly to account for dependencies? #5657

LarsHill Mar 21, 2023

Replies: 1 comment

mariosasko Mar 22, 2023 Collaborator

LarsHill
Mar 21, 2023

mariosasko
Mar 22, 2023
Collaborator