-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: move tokenizer_data_utils with the rest of utils, add further unit testing. #348
Conversation
Signed-off-by: Will Johnson <[email protected]>
… with special tokens, resize multiple of. fmt Signed-off-by: Will Johnson <[email protected]>
Thanks for making a pull request! 😃 |
Can you resolve the conflicts? I think it might need some code changes |
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
|
||
output_tokenizer_len = len(tokenizer.get_vocab()) | ||
|
||
assert output_tokenizer_len == input_tokenizer_len + 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the new return of of tokenizer_and_embedding_resize
, I think we can test it even more thorough here with the returned value which is {"num_new_tokens": num_new_tokens, "new_embedding_size": embedding_size}
.
We can assign an output above:
resize_results = tokenizer_and_embedding_resize(
special_tokens_dict=special_tokens,
tokenizer=tokenizer,
model=model,
multiple_of=1,
)
and then here, we can do in addition of checking that it's 2:
assert output_tokenizer_len - input_tokenizer_len == resize_results["num_new_tokens"]
|
||
output_tokenizer_len = len(tokenizer.get_vocab()) | ||
|
||
assert input_tokenizer_len == output_tokenizer_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in addition, we can do another assert here resize_result[num_new_tokens] == 0
as well
special_tokens_dict={}, tokenizer=tokenizer, model=model, multiple_of=8 | ||
) | ||
|
||
assert model.get_input_embeddings().embedding_dim % 8 == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to make sure that the input embeddings has multiple of 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do. Pls ignore my silly question
) | ||
|
||
assert model.get_input_embeddings().embedding_dim % 8 == 0 | ||
assert model.get_output_embeddings().out_features % 8 == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is interesting. Is out_features
the same as embedding size?
We can test here as well that resize_result["new_embedding_size"] %8 ==0
tho
Signed-off-by: Will Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description of the change
Move
tokenizer_data_utils.py
from/data
to/utils
with the rest of the utils.Update imports so function calls change from
tokenizer_data_utils.tokenizer_and_embedding_resize
totokenizer_and_embedding_resize
.Add 3 unit tests:
How to verify the PR
tox -e py
Was the PR tested