Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokens to be integers #38

Open
Tommy-Hsu opened this issue Jun 1, 2024 · 4 comments
Open

Tokens to be integers #38

Tommy-Hsu opened this issue Jun 1, 2024 · 4 comments

Comments

@Tommy-Hsu
Copy link

Hello, I would like to ask about the meaning of tokens being integers. I noticed that the final forward pass to the tokenizer involves the cls_logits_softmax tensor, and it directly performs a matrix multiplication with the codebook. However, these operations are all in floating-point. So, what does it mean for tokens to be integers in classifier stage?

@ndsl555
Copy link

ndsl555 commented Jun 24, 2024

I'd like to know too

@KunmingS
Copy link

Seems like the integer token only happens when stage I training. I think is the variable encoding_indices is this line: https://github.com/Gengzigang/PCT/blob/main/models/pct_tokenizer.py#L142

@Tommy-Hsu
Copy link
Author

Seems like the integer token only happens when stage I training. I think is the variable encoding_indices is this line: https://github.com/Gengzigang/PCT/blob/main/models/pct_tokenizer.py#L142

That's true. In stage 1, the encoding_indices would be integers but not in stage 2.

@Tommy-Hsu
Copy link
Author

image

Figure 1 is quite confusing to me. In the reference stage, the class head output should be logits, and the codebook context is composed of floating-point data. However, this image shows that both are composed of integers, which is what I find puzzling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants