-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Removal of pad_token when padding_free is set #368
refactor: Removal of pad_token when padding_free is set #368
Conversation
Thanks for making a pull request! 😃 |
1 similar comment
Thanks for making a pull request! 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its worth adding a really simple unit test
in test you can first initialize tokenzier.pad_token=None
at end of training, now train() returns num_added_tokens , you can either check if thats 0 if padding free , 1 if not padding free
or you can check tokenizer artifacts after training
@@ -288,7 +287,9 @@ def train( | |||
"PAD token set to default, to make it different from eos token" | |||
) | |||
if tokenizer.eos_token != configs.DEFAULT_PAD_TOKEN: | |||
tokenizer.pad_token = configs.DEFAULT_PAD_TOKEN | |||
tokenizer.pad_token = ( | |||
configs.DEFAULT_PAD_TOKEN if not padding_free else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configs.DEFAULT_PAD_TOKEN if not padding_free else None | |
configs.DEFAULT_PAD_TOKEN if not padding_free else None |
I dont think we need the else condition to explicitly reset it as None. We'll leave it as it was if padding_free, we just wont reset pad to a unique token . We don't want to add a new pad token / change existing pad token if padding_free basically.
@fabianlim does this change look good to you functionality wise? |
@Ssukriti personally im not sure if its a good idea to remove padding token. Because padding tokens have been so entrenced in HF legacy code that it is hard to predict what goes wrong For example, I could use padding free, but I provide a chat_template is provided that is written with a pad_token. Removing it from the tokenizer could produce unexpected results What is the strong motivation for simply removing the pad token from the tokenizer, even if padding free is enabled? If there is no performance impact why take uncessary risk? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment. please take a look. im not sure whats the motivation behind this change.
Thank you @fabianlim .
Reason being , if we dont add tokens, it will skip the post-processing additional step needed for inferring on vLLM. This will simplify the pipeline a bit. Not a dealbreaker though as the post-processing code is tested well. If you feel in future we will add templates that will need , we can wait on merging this PR till it is more clear. |
I would second on what @fabianlim said. If the postprocessing step is already there and and not a huge overhead. Its good to not remove pad token case by case and keep it there for all the usecases, since it might only complicate in the future feature additions like chat templates etc. and needs rethinking for every usecase we would need to support. |
Description of the change
1- When
padding_free
is set,pad_token
is not assigned totokenizer
orspecial_tokens_dict
.2- Removal of
pad_token
addition toGPT2Tokenizer
andGPTNeoXTokenizerFast
here asthis is redundant and already covered in the later condition here.
Discussion Slack Thread
Related issue number
#1336
How to verify the PR
Padding free tuning of
granite-13b-base-v2
(Tokenizer:GPTNeoXTokenizerFast
) PVC path:ibm-granite-pvc/granite-13b-base-v2/step_300000_ckpt
Was the PR tested