Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curiosity about Model Choice: Swin-based vs. ViTPose with PCT #19

Open
Janus-Shiau opened this issue Aug 23, 2023 · 1 comment
Open

Comments

@Janus-Shiau
Copy link

Hello @Gengzigang and team,

The idea of representing human pose as compositional tokens (PCT) is both unique and compelling. By modeling the relationship between keypoints in such a structured manner, it's pretty inspiring.

However, I have a question regarding your model choice. I noticed that you opted for a Swin-based model for implementation. Given the current success and traction of ViTPose, I'm curious as to why you didn't choose to integrate PCT directly with ViTPose. Was there a specific reason or advantage for preferring the Swin-based model over ViTPose when incorporating PCT?

Thank you for taking the time to answer. I'm eager to delve deeper into your work and truly appreciate the effort you've put into this research. Looking forward to your insights!

Warm regards,
Jia-Yau

@Gengzigang
Copy link
Owner

Hi Jia-Yau, thank you for your interest in our work. Our PCT and ViTPose were developed concurrently. When we started working on PCT, ViTPose hadn't been released yet. At that time, Swin backbone performed well in other computer vision tasks, so it was a natural choice to use Swin as the backbone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants