Seeking instructions on customizing dataset #252

zjysteven · 2023-12-26T20:07:28Z

zjysteven
Dec 26, 2023

Hi,

I'm trying to adapt GLIGEN to SDXL, which conditions on both captions and object bounding box coordinates to generate images. For this reason some customization of the dataset is needed, and I would greatly appreciate some instructions on how to achieve this (e.g., which module or part of the codebase I should be looking at).

Also, would you kindly point me to where the VAE output caching is taking place in train_sdxl.py? I somehow failed to locate where it happens.

Thank you in advance for your time.

bghira · 2023-12-26T20:18:42Z

bghira
Dec 26, 2023
Maintainer

VAECache is the module responsible for undertaking the actual caching: https://github.com/bghira/SimpleTuner/blob/main/helpers/caching/vae.py

the caching is initiated during the dataset configuration: https://github.com/bghira/SimpleTuner/blob/main/helpers/data_backend/factory.py#L297

the aspect sampler is responsible for parsing bucketed data and returning valid batches of samples: https://github.com/bghira/SimpleTuner/blob/main/helpers/multiaspect/sampler.py#L313

it's also at that point where the most information is available about a sample. the image_metadata dict is gathered and provided to the collate_fn: https://github.com/bghira/SimpleTuner/blob/main/helpers/training/collate.py#L174

that further adjusts and calculates some runtime information for the samples, incl whether or not dropout is applied to the caption.

the trainer then consumes these samples here:

the batch is pulled from the dataloader wrapper: https://github.com/bghira/SimpleTuner/blob/main/train_sdxl.py#L901
the (currently rudimentary) dataloader wrapper decides which dataset to use: https://github.com/bghira/SimpleTuner/blob/main/helpers/data_backend/factory.py#L397
the concrete dataloader is just a thin layer for pytorch to interact with the sampler: https://github.com/bghira/SimpleTuner/blob/main/helpers/multiaspect/dataset.py#L30

you might be interested in the metadata scan process if you are going to be actively reading images in the dataset to collect metadata, which is an expensive operation on S3 buckets. this would allow keeping a quick list of properties you need for later. that is here: https://github.com/bghira/SimpleTuner/blob/main/helpers/multiaspect/bucket.py#L642

7 replies

bghira Dec 26, 2023
Maintainer

feel free to ask any further questions you need, it sounds like a neat extension

zjysteven Dec 29, 2023
Author

Hi @bghira, I do have some further questions and would appreciate any suggestions. I'd like to first briefly describe my case to better establish the context.

Overview

As I mentioned I'm trying to apply GLIGEN to SDXL. It adds 1) an additional attention layer within the UNet and 2) an encoder for bounding box coordinates. These two are the only trainable parameters, while all the original weights of the UNet are frozen. Apart from the text prompt, the additional conditions include 1) encoded bbox embeddings and 2) encoded entity names (each corresponding to 1 bbox) from the text encoder.

Data management

In my case I have done all the preprocessing offline, since adjusting bbox coordinates according to image transformation on-the-fly would be error-prone. Now I have three local folders containing the training data structured like this:

data
├── images
│   ├── 0000.png
│   ├── ...
├── captions
│   ├── 0000.txt
│   ├── ...
├── groundings
│   ├── 0000.txt
│   ├── ...

where captions/0000.txt contains the caption for image 0000.png, and groundings/0000.txt contains the bbox coordinates and entity/object names, looking like this:

car,0.00,0.23,0.76,0.65 #(x_min,y_min,x_max,y_max)
person,0.83,0.34,0.99,0.82
...

Importantly, for the same reason mentioned above, all images have been resized to one of the pre-defined sizes (e.g., one of [(416,608), (512,512), (608,416)]).

Data feeding

For images and text prompts/captions, I will definitely keep using the cache. For object/entity names, I can still cache them following how prompts are processed. For bbox coordinates, however, I need to actively feed them to the trainable modules.

Questions

I'm still trying to comprehend the detailed design of the data pipeline of SimpleTuner and haven't had an even rough idea on how to implement the loading of bbox coordinates so that their loading can work seamlessly with the current loading of image and caption embeddings. What do you think would be the easiest modification?
Since I already resize all images to pre-defined sizes offline, I would remove any online transformation. Do you see any problem with this, or is there anything that I should be careful about when removing transformations?

Thank you so much for reading till the end. Really appreciate your time and help.

bghira Dec 29, 2023
Maintainer

i pointed out the metadata scan procedure before, and this is where you would want to do that work. assuming you have an entry in multidatabackend.json for this set, you could add new keys to your dataset config.

{
"id": "dataset-name-123",
"feature_flags": "gligen",
"gligen_captions_path": "/path/to/captions",
"gligen_groundings_path": "/path/to/groundings"
}

you can grab the dataset config quite easily with StateTracker.get_data_backend_config(id=bucket_manager.id) inside the MultiaspectImage._bucket_worker method.

i would add a new method, prepare_feature_metadata(backend_config, image_metadata) that look at whatever new feature flags and set up their respective metadata. this would just have logic for gligen so far.

in the prepare_feature_metadata, you will want to call another new function, something like prepare_feature_metadata_gligen(backend_config, image_metadata) that takes in the sample and returns the new image metadata, to extend/replace values in the current image_metadata.

inside the prepare_feature_metadata_gligen method, check if "gligen" in data_backend_config["feature_flags"] and then validate whether the paths are set as necessary. ensure that the gligen_captions_path is set and non-empty and that it exists, and the same for your groundings.

read the respective files and add their content to the image_metadata return. have the returned image_metadata extend the document that was passed in initially.

at least at that point every single sample will (ideally) have its metadata associated correctly. such that

bucket_manager.get_metadata_by_filepath(...)

will return every one of the new keys you've added.

these will be useful later in the training loop.

zjysteven Dec 29, 2023
Author

Thank you again! This overall flow is super helpful.

bghira Dec 29, 2023
Maintainer

i think you'll want a startup error check in data_backends/factory.py that, upon loading the backends, check that ALL of the backends have the gligen feature flag if a single one does. that will prevent the odd problems of "how do we ensure the gligen training is interoperable with standard training?" - it doesn't need to be, it just needs to check that the datasets are all of the same type.

for the training loop, i think it's a bit more complicated. you'll want to have some way of determining whether we need to freeze components like the vanilla layers of the u-net. you'll want to have conditionals inside the training loop so that we run the gligen code where that makes sense.

ergo, when you reach the point of modifying train_sdxl.py, it might actually make sense to abstract the training loop first.

there's bits and pieces that should be shared between SD 2.x and SDXL, but they're not. but doing that first will allow easily dropping in the GLIGEN code as an alternate codepath for both.

this module would likely fit in as helpers/training/training_loop.py.

you'll have to review the collate_fn method and the MultiAspectSampler yield logic to ensure the metadata you want is being returned.

i would once again add a new method for the collate module so that the GLIGEN stuff is abstracted out and can easily be ignored.

camoody1 · 2024-03-26T06:07:44Z

camoody1
Mar 26, 2024

@zjysteven Hello! I'm just starting to look into using GLIGEN for a workflow and I found that it is only available for SD15. I found this issue thread and was wondering if you had many any headway on creating GLIGEN for SDXL? I would love to give it a try if you have.

Thanks!

1 reply

zjysteven Mar 26, 2024
Author

Unfortunately no. Yeah earlier I was trying to train GLIGEN for SDXL using this repo's training code but had no luck in getting something meaningful. I gave up trying because I saw several people mentioning that even with the official GLIGEN training code they failed to reproduce performance on SD1.5.

camoody1 · 2024-03-26T12:59:53Z

camoody1
Mar 26, 2024

@zjysteven That's a shame. Thank you for letting me know.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking instructions on customizing dataset #252

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Seeking instructions on customizing dataset #252

zjysteven Dec 26, 2023

Replies: 3 comments · 8 replies

bghira Dec 26, 2023 Maintainer

bghira Dec 26, 2023 Maintainer

zjysteven Dec 29, 2023 Author

Overview

Data management

Data feeding

Questions

bghira Dec 29, 2023 Maintainer

zjysteven Dec 29, 2023 Author

bghira Dec 29, 2023 Maintainer

camoody1 Mar 26, 2024

zjysteven Mar 26, 2024 Author

camoody1 Mar 26, 2024

zjysteven
Dec 26, 2023

Replies: 3 comments 8 replies

bghira
Dec 26, 2023
Maintainer

bghira Dec 26, 2023
Maintainer

zjysteven Dec 29, 2023
Author

bghira Dec 29, 2023
Maintainer

zjysteven Dec 29, 2023
Author

bghira Dec 29, 2023
Maintainer

camoody1
Mar 26, 2024

zjysteven Mar 26, 2024
Author

camoody1
Mar 26, 2024