refactor: move tokenizer_data_utils with the rest of utils, add further unit testing. #348

willmj · 2024-09-25T15:06:58Z

Description of the change

Move tokenizer_data_utils.py from /data to /utils with the rest of the utils.
Update imports so function calls change from tokenizer_data_utils.tokenizer_and_embedding_resize to tokenizer_and_embedding_resize.
Add 3 unit tests:

Ensure adding special tokens works correctly
Ensure not adding special tokens doesn't modify tokenizer
Ensure input and output embeddings are resized properly

How to verify the PR

tox -e py

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Will Johnson <[email protected]>

… with special tokens, resize multiple of. fmt Signed-off-by: Will Johnson <[email protected]>

github-actions · 2024-09-25T15:07:10Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

aluu317 · 2024-10-03T22:39:26Z

Can you resolve the conflicts? I think it might need some code changes

Signed-off-by: Will Johnson <[email protected]>

aluu317 · 2024-10-04T19:34:21Z

tests/utils/test_embedding_resize.py

+
+    output_tokenizer_len = len(tokenizer.get_vocab())
+
+    assert output_tokenizer_len == input_tokenizer_len + 2


Because the new return of of tokenizer_and_embedding_resize, I think we can test it even more thorough here with the returned value which is {"num_new_tokens": num_new_tokens, "new_embedding_size": embedding_size}.
We can assign an output above:

resize_results = tokenizer_and_embedding_resize( special_tokens_dict=special_tokens, tokenizer=tokenizer, model=model, multiple_of=1, )

and then here, we can do in addition of checking that it's 2:

assert output_tokenizer_len - input_tokenizer_len == resize_results["num_new_tokens"]

aluu317 · 2024-10-04T19:36:12Z

tests/utils/test_embedding_resize.py

+
+    output_tokenizer_len = len(tokenizer.get_vocab())
+
+    assert input_tokenizer_len == output_tokenizer_len


in addition, we can do another assert here resize_result[num_new_tokens] == 0 as well

aluu317 · 2024-10-04T19:40:14Z

tests/utils/test_embedding_resize.py

+        special_tokens_dict={}, tokenizer=tokenizer, model=model, multiple_of=8
+    )
+
+    assert model.get_input_embeddings().embedding_dim % 8 == 0


do we need to make sure that the input embeddings has multiple of 8?

Yes, we do. Pls ignore my silly question

aluu317 · 2024-10-04T19:50:12Z

tests/utils/test_embedding_resize.py

+    )
+
+    assert model.get_input_embeddings().embedding_dim % 8 == 0
+    assert model.get_output_embeddings().out_features % 8 == 0


this is interesting. Is out_features the same as embedding size?
We can test here as well that resize_result["new_embedding_size"] %8 ==0 tho

Signed-off-by: Will Johnson <[email protected]>

aluu317

LGTM! Thank you

Signed-off-by: Will Johnson <[email protected]>

aluu317

LGTM

willmj added 2 commits September 25, 2024 10:18

cleanup: Move tokenizer_data_utils to /utils to /data, change imports

c1f4faa

Signed-off-by: Will Johnson <[email protected]>

tests: Add additional tests for test_embedding_resize to check resize…

ac05a6d

… with special tokens, resize multiple of. fmt Signed-off-by: Will Johnson <[email protected]>

willmj requested review from anhuong, Ssukriti and alex-jw-brooks as code owners September 25, 2024 15:06

github-actions bot added the refactor label Sep 25, 2024

anhuong requested review from aluu317 and removed request for alex-jw-brooks October 3, 2024 13:57

willmj added 2 commits October 4, 2024 09:11

Merge branch 'main' into 1344

c34377b

Signed-off-by: Will Johnson <[email protected]>

lint

d9dcd2c

Signed-off-by: Will Johnson <[email protected]>

willmj requested review from fabianlim and kmehant as code owners October 4, 2024 13:17

aluu317 reviewed Oct 4, 2024

View reviewed changes

fix: more thorough testing from output of function

b8c4f9d

Signed-off-by: Will Johnson <[email protected]>

aluu317 previously approved these changes Oct 8, 2024

View reviewed changes

test: move assertion

998eff3

Signed-off-by: Will Johnson <[email protected]>

willmj dismissed aluu317’s stale review via 998eff3 October 8, 2024 20:43

Merge branch 'main' into 1344

dac71f5

Signed-off-by: Will Johnson <[email protected]>

willmj enabled auto-merge (squash) October 8, 2024 20:51

aluu317 approved these changes Oct 8, 2024

View reviewed changes

willmj merged commit ee2cd66 into foundation-model-stack:main Oct 8, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: move tokenizer_data_utils with the rest of utils, add further unit testing. #348

refactor: move tokenizer_data_utils with the rest of utils, add further unit testing. #348

willmj commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

aluu317 commented Oct 3, 2024

aluu317 Oct 4, 2024 •

edited

Loading

aluu317 Oct 4, 2024

aluu317 Oct 4, 2024

aluu317 Oct 4, 2024

aluu317 Oct 4, 2024

aluu317 left a comment

aluu317 left a comment


		output_tokenizer_len = len(tokenizer.get_vocab())

		assert output_tokenizer_len == input_tokenizer_len + 2


		output_tokenizer_len = len(tokenizer.get_vocab())

		assert input_tokenizer_len == output_tokenizer_len

refactor: move tokenizer_data_utils with the rest of utils, add further unit testing. #348

refactor: move tokenizer_data_utils with the rest of utils, add further unit testing. #348

Conversation

willmj commented Sep 25, 2024

Description of the change

How to verify the PR

Was the PR tested

github-actions bot commented Sep 25, 2024

aluu317 commented Oct 3, 2024

aluu317 Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

aluu317 Oct 4, 2024

Choose a reason for hiding this comment

aluu317 Oct 4, 2024

Choose a reason for hiding this comment

aluu317 Oct 4, 2024

Choose a reason for hiding this comment

aluu317 Oct 4, 2024

Choose a reason for hiding this comment

aluu317 left a comment

Choose a reason for hiding this comment

aluu317 left a comment

Choose a reason for hiding this comment

aluu317 Oct 4, 2024 •

edited

Loading