Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Removal of pad_token when padding_free is set #368

Conversation

Abhishek-TAMU
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU commented Oct 8, 2024

Description of the change

1- When padding_free is set, pad_token is not assigned to tokenizer or special_tokens_dict.

2- Removal of pad_token addition to GPT2Tokenizer and GPTNeoXTokenizerFast here as
this is redundant and already covered in the later condition here.

Discussion Slack Thread

Related issue number

#1336

How to verify the PR

Padding free tuning of granite-13b-base-v2 (Tokenizer: GPTNeoXTokenizerFast) PVC path: ibm-granite-pvc/granite-13b-base-v2/step_300000_ckpt

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Copy link

github-actions bot commented Oct 8, 2024

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

1 similar comment
Copy link

github-actions bot commented Oct 8, 2024

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@Abhishek-TAMU Abhishek-TAMU changed the title Refactor: Removal of pad_token when padding_free is set refactor: Removal of pad_token when padding_free is set Oct 8, 2024
Copy link
Collaborator

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth adding a really simple unit test

in test you can first initialize tokenzier.pad_token=None

at end of training, now train() returns num_added_tokens , you can either check if thats 0 if padding free , 1 if not padding free

or you can check tokenizer artifacts after training

@@ -288,7 +287,9 @@ def train(
"PAD token set to default, to make it different from eos token"
)
if tokenizer.eos_token != configs.DEFAULT_PAD_TOKEN:
tokenizer.pad_token = configs.DEFAULT_PAD_TOKEN
tokenizer.pad_token = (
configs.DEFAULT_PAD_TOKEN if not padding_free else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
configs.DEFAULT_PAD_TOKEN if not padding_free else None
configs.DEFAULT_PAD_TOKEN if not padding_free else None

I dont think we need the else condition to explicitly reset it as None. We'll leave it as it was if padding_free, we just wont reset pad to a unique token . We don't want to add a new pad token / change existing pad token if padding_free basically.

@Ssukriti
Copy link
Collaborator

@fabianlim does this change look good to you functionality wise?

@fabianlim
Copy link
Collaborator

@Ssukriti personally im not sure if its a good idea to remove padding token. Because padding tokens have been so entrenced in HF legacy code that it is hard to predict what goes wrong

For example, I could use padding free, but I provide a chat_template is provided that is written with a pad_token. Removing it from the tokenizer could produce unexpected results

What is the strong motivation for simply removing the pad token from the tokenizer, even if padding free is enabled? If there is no performance impact why take uncessary risk?

Copy link
Collaborator

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment. please take a look. im not sure whats the motivation behind this change.

@Ssukriti
Copy link
Collaborator

Ssukriti commented Oct 10, 2024

Thank you @fabianlim .

  1. We aren't removing pad token if its set in tokenizer and if padding free. We just wont additionally add it , if not set in tokenizer and if 'padding free' is passed. As we want to avoid adding new tokens if not needed by tuning. We couldn't come up with a use case as to why to add pad token if not set (with padding free)
    if tokenzier.pad_token = None , we wont add it and leave it as None.
    if tokenzier.pad_token != None, it will remain set and we will not modify it

Reason being , if we dont add tokens, it will skip the post-processing additional step needed for inferring on vLLM. This will simplify the pipeline a bit.

Not a dealbreaker though as the post-processing code is tested well. If you feel in future we will add templates that will need , we can wait on merging this PR till it is more clear.

@kmehant
Copy link
Collaborator

kmehant commented Oct 10, 2024

I would second on what @fabianlim said. If the postprocessing step is already there and and not a huge overhead. Its good to not remove pad token case by case and keep it there for all the usecases, since it might only complicate in the future feature additions like chat templates etc. and needs rethinking for every usecase we would need to support.

@willmj willmj closed this Oct 23, 2024
@Abhishek-TAMU Abhishek-TAMU deleted the tokenization_cleanup branch November 26, 2024 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants