Replies: 1 comment
-
Hi! You can use I find this batching procedure too specific to be integrated into |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Given I have a dataset with 10 documents and each document contains between 3 and 10 text segments, e.g.
A text segment can be for example a paragraph.
Is there an easy way to create a dataset from this file that supports document aware batching, given that subsequent text segments of the same document should occur in subsequent samples in the same batch?
So for the above example and with a batch size of 4 I would expect an output like this:
As you can see shuffling would be done on a document level and all documents are more or less equally distributed over all batches so that the amount of padding with None segments is minimized. With such a dataset/dataloader one could model dependencies between samples per batch, given the previous segment is from the same document.
Would it be possible to create something like this with the current
datasets
package? Or is it always assumed that all samples are completely independent?If it is not supported, do you think it makes sense to integrate such functionality?
I would assume something like this would benefit in training models like
transformer-xl
where some dependencies between text segments are expected.Beta Was this translation helpful? Give feedback.
All reactions