Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: move tokenizer_data_utils with the rest of utils, add further unit testing. #348

Merged
merged 7 commits into from
Oct 8, 2024

Conversation

willmj
Copy link
Collaborator

@willmj willmj commented Sep 25, 2024

Description of the change

Move tokenizer_data_utils.py from /data to /utils with the rest of the utils.
Update imports so function calls change from tokenizer_data_utils.tokenizer_and_embedding_resize to tokenizer_and_embedding_resize.
Add 3 unit tests:

  • Ensure adding special tokens works correctly
  • Ensure not adding special tokens doesn't modify tokenizer
  • Ensure input and output embeddings are resized properly

How to verify the PR

tox -e py

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@anhuong anhuong requested review from aluu317 and removed request for alex-jw-brooks October 3, 2024 13:57
@aluu317
Copy link
Collaborator

aluu317 commented Oct 3, 2024

Can you resolve the conflicts? I think it might need some code changes

willmj added 2 commits October 4, 2024 09:11
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>

output_tokenizer_len = len(tokenizer.get_vocab())

assert output_tokenizer_len == input_tokenizer_len + 2
Copy link
Collaborator

@aluu317 aluu317 Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the new return of of tokenizer_and_embedding_resize, I think we can test it even more thorough here with the returned value which is {"num_new_tokens": num_new_tokens, "new_embedding_size": embedding_size}.
We can assign an output above:

resize_results = tokenizer_and_embedding_resize(
        special_tokens_dict=special_tokens,
        tokenizer=tokenizer,
        model=model,
        multiple_of=1,
    )

and then here, we can do in addition of checking that it's 2:

assert output_tokenizer_len - input_tokenizer_len == resize_results["num_new_tokens"]


output_tokenizer_len = len(tokenizer.get_vocab())

assert input_tokenizer_len == output_tokenizer_len
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition, we can do another assert here resize_result[num_new_tokens] == 0 as well

special_tokens_dict={}, tokenizer=tokenizer, model=model, multiple_of=8
)

assert model.get_input_embeddings().embedding_dim % 8 == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to make sure that the input embeddings has multiple of 8?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. Pls ignore my silly question

)

assert model.get_input_embeddings().embedding_dim % 8 == 0
assert model.get_output_embeddings().out_features % 8 == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is interesting. Is out_features the same as embedding size?
We can test here as well that resize_result["new_embedding_size"] %8 ==0 tho

aluu317
aluu317 previously approved these changes Oct 8, 2024
Copy link
Collaborator

@aluu317 aluu317 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you

Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
@willmj willmj enabled auto-merge (squash) October 8, 2024 20:51
Copy link
Collaborator

@aluu317 aluu317 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@willmj willmj merged commit ee2cd66 into foundation-model-stack:main Oct 8, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants