You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All of the documentation I have sifted through has referenced essentially re-saving data when the format changes, but is there a way to use this library without that? A good use-case for this example is if you have a very large amount of data. You can read your data if it's a supported format, some conversion occurs, then you pass it into your pipeline. It would add to the cost of data-loading, but it can be worth it if it saves terabytes of disk space.
Dataset
...
subsets
test: # of items=...
train: # of items=...
val: # of items=...
infos
...
Where I would want to implement a PyTorch Lightning data module like:
importlightningasLfromdatumaro.components.datasetimportDatasetclassMyDataModule(L.LightningDataModule):
def__init__(self, batch_size: int=32):
super().__init__()
self.batch_size=batch_sizedefsetup(self, stage: str):
# Being able to load only specific subsets would be nice here too, but that sounds like a large undertakingdataset=Dataset.import_from("./data", "yolo")
ifstage=="fit":
dataset_train=dataset.get_subset("train")
dataset_val=dataset.get_subset("val")
ifstage=="test":
dataset_test=dataset.get_subset("test")
deftrain_dataloader(self):
returnDataLoader(self.dataset_train, batch_size=self.batch_size)
# and so on for "test" and "val"
Is this possible? Neither of these solutions I have tried work:
---> [96] assert (subset or DEFAULT_SUBSET_NAME) == (self.name or DEFAULT_SUBSET_NAME)
AssertionError:
The problem I am running into is that the subsets are not able to be separated from the main dataset & are not treated as their own dataset. Could I be doing anything differently?
This is the main thing stopping me from using this really useful library in my pipeline, as I can really see potential but it doesn't offer specifically the dataloading features I am looking for (which might be on purpose). If anyone knows of a good method / tool to do this, I would love to hear! Thank you 😄
The text was updated successfully, but these errors were encountered:
Hi @HalkScout. Thanks for your interest in Datumaro! 😊
The main purpose of changing the data format is to export it to disk. See our notebook with format change.
Regarding indexing, while the .get_subset() method's return type doesn't currently support direct indexing, you can easily convert it to a dm.Dataset using .as_dataset(). This will allow you to use standard indexing operations.
In your second snippet, make sure you're passing the correct arguments to the .get() method. The first argument should be the desired item's id (usually the image file name), and the second argument should be the actual name of the subset, which in this case is 'train'.
If you're still encountering issues, please provide more details about your specific use case, and I'll be happy to assist further.
All of the documentation I have sifted through has referenced essentially re-saving data when the format changes, but is there a way to use this library without that? A good use-case for this example is if you have a very large amount of data. You can read your data if it's a supported format, some conversion occurs, then you pass it into your pipeline. It would add to the cost of data-loading, but it can be worth it if it saves terabytes of disk space.
I am loading a dataset like this:
Where I would want to implement a PyTorch Lightning data module like:
Is this possible? Neither of these solutions I have tried work:
Or even an attempt to make a wrapper:
The problem I am running into is that the subsets are not able to be separated from the main dataset & are not treated as their own dataset. Could I be doing anything differently?
This is the main thing stopping me from using this really useful library in my pipeline, as I can really see potential but it doesn't offer specifically the dataloading features I am looking for (which might be on purpose). If anyone knows of a good method / tool to do this, I would love to hear! Thank you 😄
The text was updated successfully, but these errors were encountered: