Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retraining SAM #1

Open
alexcbb opened this issue Jul 17, 2023 · 2 comments
Open

Retraining SAM #1

alexcbb opened this issue Jul 17, 2023 · 2 comments

Comments

@alexcbb
Copy link

alexcbb commented Jul 17, 2023

Hello, thank you for your very interesting job.

I was wondering for the training of SAM, I went through your code and saw that you used a classical Data Parallism parallelism for the training, did you encounter any issue for the training in terms of memory ? I also checked at the config file you use, did you retrain all the model only with a batch size of 2 and did you only tested with SAM-b version ? If so, I was wondering, how much time did the training took you on the 4GPUs A40 ?

Thank you in advance for your answer !

@JasonQSY
Copy link
Owner

JasonQSY commented Jul 17, 2023

Thanks for your interests in our work. To clarify,

you used a classical Data Parallism parallelism for the training, did you encounter any issue for the training in terms of memory ?

Yes, we only use DDP for training. The code tests and adds support for mix-precision training but it's not necessary to train it on A40. https://github.com/JasonQSY/3DOI/blob/main/monoarti/configs/sam.yaml#L23
It's doable but we do reduce the batch size and use vit_b as the backbone to fit it into the gpu memory.

I also checked at the config file you use, did you retrain all the model only with a batch size of 2 and did you only tested with SAM-b version ?

Yes. Other backbone needs more gpu memory. vit_h will use too much gpu memory under our current ddp setup. I've tried it and realized it needs more tricks to save gpu memory (such as deepspeed or fsdp). To simplify it I just use vit_b.

how much time did the training took you on the 4GPUs A40 ?

I don't remember the exact time and I try to give a rough estimation. For SAM, it takes approximately 36h to train 200 epochs.

Please let me know if you want to learn more implementation details and I can help.

@alexcbb
Copy link
Author

alexcbb commented Jul 18, 2023

Thank you very much for you clarification it was all I needed !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants