Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting test/train/val and representative datasets, and convert to tfrecords #1510

Closed
therealpurplemana opened this issue May 28, 2024 · 2 comments
Assignees

Comments

@therealpurplemana
Copy link

therealpurplemana commented May 28, 2024

Hi, thanks for your great project. I'm using it to export data from cvat.ai, manipulate, and re-export into Tensorflow format.

In my specific case, I'm combining homogenius datasets by adding sources to a project which I exported from cvat.ai (so I can prune out incompletely labeled datasets), then I run

!datum transform --project ./tfdata -t split -- -t detection \ --subset train:.7 --subset val:.15 --subset test:.15

After which, I run to export it:
!datum project export -p ./tfdata --format tf_detection_api -o ./final-export-tf_detection_api-detection -- --save-media (and --save-masks for segmentation export)

This produces a new folder with subfolders with /annotations and /images organized into train/test/val.json and respectively in the /images folder nicely packaged as TFRecords. There's also oddly a default.tfrecord but it was pretty small so I just deleted it.

Now, I also need a 20% representative dataset from my original dataset -- how do I "undo" the splits in my project? Or am I thinking about this incorrectly?

Currently, I need to delete the project, recreate it, re-add my sources, re-split into 20/80%, and then export again, and copy over the TFRecord.

Curious if there's an easier way to do this either through CLI or Python.

@jihyeonyi
Copy link
Contributor

jihyeonyi commented May 30, 2024

Hi @therealpurplemana, thank you for your interest in our project.
Datumaro offers a version control feature, but it requires commits of the project.
Alternatively, you could combine all subsets into a single dataset and then re-split them as needed.

@therealpurplemana
Copy link
Author

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants