Make tokenize tests readable #1868

krammnic · 2024-10-19T18:55:11Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
clean up

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

Tokenizer tests should be refactored (all models) #1823

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

Will require changes in CI(pre-commit run makes expected_tokens lists unreadable)

pytorch-bot · 2024-10-19T18:55:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1868

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit cc60b25 with merge base d5c54f3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

krammnic · 2024-10-19T18:56:21Z

cc: @RdoubleA @joecummings What do you think? With current lint formatting working with this tests is really awful. Pretty minor fix

codecov-commenter · 2024-10-20T15:46:41Z

Codecov Report

Attention: Patch coverage is 50.00000% with 15 lines in your changes missing coverage. Please review.

Project coverage is 24.77%. Comparing base (d0aa871) to head (cc60b25).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...s/torchtune/models/llama3/test_llama3_tokenizer.py	0.00%	6 Missing ⚠️
...s/torchtune/models/llama2/test_llama2_tokenizer.py	60.00%	2 Missing ⚠️
...torchtune/models/mistral/test_mistral_tokenizer.py	66.66%	2 Missing ⚠️
tests/torchtune/models/phi3/test_phi3_tokenizer.py	66.66%	2 Missing ⚠️
...sts/torchtune/models/qwen2/test_qwen2_tokenizer.py	0.00%	2 Missing ⚠️
...sts/torchtune/models/gemma/test_gemma_tokenizer.py	80.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1868       +/-   ##
===========================================
- Coverage   67.27%   24.77%   -42.50%     
===========================================
  Files         318      318               
  Lines       17648    17633       -15     
===========================================
- Hits        11873     4369     -7504     
- Misses       5775    13264     +7489

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

krammnic · 2024-10-20T15:46:45Z

Lint CI at this point should be changed, if not the formating will be still really bad in case of expected_tokens

joecummings

Looks good - I will look into the linter issue. My naive assumption was that noqa should work.

Couple questions about possibly unintended formatting errors

joecummings · 2024-10-21T13:40:04Z

tests/torchtune/models/gemma/test_gemma_tokenizer.py

        messages = [
            Message(
                role="user",
                content="Below is an instruction that describes a task. Write a response "
-                "that appropriately completes the request.\n\n### Instruction:\nGenerate "
-                "a realistic dating profile bio.\n\n### Response:\n",
+                        "that appropriately completes the request.\n\n### Instruction:\nGenerate "


What are these changes?

Or, something change from my local linter changes probably, will fix

joecummings · 2024-10-21T13:40:30Z

tests/torchtune/models/llama3/test_llama3_tokenizer.py

@@ -311,21 +147,17 @@ def test_tokenizer_vocab_size(self, tokenizer):
        assert tokenizer.vocab_size == 128257

    def test_tokenize_text_messages(
-        self, tokenizer, user_text_message, assistant_text_message
+            self, tokenizer, user_text_message, assistant_text_message


Same is here

krammnic · 2024-10-21T15:36:49Z

I assume that fixed

joecummings · 2024-10-21T15:47:44Z

I assume that fixed

Grr still failing. Mind if I take a look?

krammnic · 2024-10-21T15:53:45Z

Isn't it about lines with # noqa?

krammnic · 2024-10-21T16:01:23Z

Ah, I see:
tests/torchtune/models/llama3/test_llama3_tokenizer.py:226:183: B950 line too long (182 > 120 characters)

krammnic · 2024-10-21T16:03:12Z

Fixed

krammnic · 2024-10-21T16:24:52Z

One more...

krammnic · 2024-10-21T16:29:51Z

fixed with flake

krammnic · 2024-10-21T16:31:36Z

@joecummings Sorry for such lint failures, but I could not able to run pre-commit run --all-files due to current fixes

krammnic · 2024-10-22T17:30:40Z

@joecummings Probably I found solution we need to use both # noqa and # fmt: skip. But I really don't like it

krammnic · 2024-10-23T14:12:36Z

Oh, I broke something...

krammnic · 2024-10-23T14:18:42Z

I don't know what is it, this tests are passing on my local and branch is up to date

krammnic · 2024-10-23T16:18:26Z

Isn't it related too #1886? Some weird fail with torchao

krammnic · 2024-10-24T12:58:51Z

Can we restart CI here? Or I'm not sure how to fix some torchao unrelated stuff

krammnic · 2024-10-24T14:36:48Z

@felipemello1 @RdoubleA Maybe you can comment how to fix this torchao thing? Really strange and probably just CI rerun can't help

RdoubleA · 2024-10-24T14:49:58Z

@krammnic have you merged from main after this PR was checked in? #1886

krammnic · 2024-10-24T16:55:08Z

@krammnic have you merged from main after this PR was checked in? #1886

Oh, I see. Let me merge it, yes.

krammnic · 2024-10-26T11:40:05Z

Fixed

krammnic · 2024-10-26T20:19:17Z

Can someone restart CI?

krammnic · 2024-10-29T18:37:55Z

resolved

krammnic · 2024-11-05T18:36:19Z

@RdoubleA Can we fix Qwen2 and Qwen2.5 tests in separate PR? I will open it immediately after we merge this without other models

RdoubleA · 2024-11-13T14:27:18Z

@krammnic Sorry for the delay, that sounds good to me. Looks like we just need to resolve merge conflicts and we can land this.

RdoubleA · 2024-11-18T20:38:02Z

@krammnic Went ahead and did the merge with main, thanks again for your help!

make tokenize tests readable

8dd57b1

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 19, 2024

joecummings reviewed Oct 21, 2024

View reviewed changes

fix formatting

68cb79f

fix lint

e7e11f2

krammnic added 2 commits October 21, 2024 12:26

more lint

b1079dc

fix lint

4eb2fb9

krammnic added 2 commits October 23, 2024 08:28

attempt to fix ufmt

b6a6da7

add fmt: on/off

d6cd349

fix torchao check

35d06ff

resolve conflict

f2a1743

krammnic added 2 commits November 2, 2024 08:13

resolve

c70a54e

fix qwen2

f8b93f4

krammnic force-pushed the tests-refactor branch from bdf3be3 to f8b93f4 Compare November 5, 2024 18:35

RdoubleA approved these changes Nov 13, 2024

View reviewed changes

Merge branch 'main' into tests-refactor

cc60b25

RdoubleA merged commit f31754f into pytorch:main Nov 18, 2024
17 checks passed

krammnic mentioned this pull request Dec 26, 2024

Nit: inconsistent linting in qwen2.5 tokenizer test #2208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tokenize tests readable #1868

Make tokenize tests readable #1868

krammnic commented Oct 19, 2024

pytorch-bot bot commented Oct 19, 2024 •

edited

Loading

krammnic commented Oct 19, 2024

codecov-commenter commented Oct 20, 2024 •

edited

Loading

krammnic commented Oct 20, 2024 •

edited

Loading

joecummings left a comment

joecummings Oct 21, 2024

krammnic Oct 21, 2024

joecummings Oct 21, 2024

krammnic Oct 21, 2024

krammnic commented Oct 21, 2024

joecummings commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 22, 2024

krammnic commented Oct 23, 2024

krammnic commented Oct 23, 2024

krammnic commented Oct 23, 2024

krammnic commented Oct 24, 2024

krammnic commented Oct 24, 2024

RdoubleA commented Oct 24, 2024

krammnic commented Oct 24, 2024

krammnic commented Oct 26, 2024

krammnic commented Oct 26, 2024

krammnic commented Oct 29, 2024

krammnic commented Nov 5, 2024

RdoubleA commented Nov 13, 2024

RdoubleA commented Nov 18, 2024

Make tokenize tests readable #1868

Make tokenize tests readable #1868

Conversation

krammnic commented Oct 19, 2024

Context

Changelog

Test plan

UX

pytorch-bot bot commented Oct 19, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1868

❗ 2 Active SEVs

✅ No Failures

krammnic commented Oct 19, 2024

codecov-commenter commented Oct 20, 2024 • edited Loading

Codecov Report

krammnic commented Oct 20, 2024 • edited Loading

joecummings left a comment

Choose a reason for hiding this comment

joecummings Oct 21, 2024

Choose a reason for hiding this comment

krammnic Oct 21, 2024

Choose a reason for hiding this comment

joecummings Oct 21, 2024

Choose a reason for hiding this comment

krammnic Oct 21, 2024

Choose a reason for hiding this comment

krammnic commented Oct 21, 2024

joecummings commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 21, 2024

krammnic commented Oct 22, 2024

krammnic commented Oct 23, 2024

krammnic commented Oct 23, 2024

krammnic commented Oct 23, 2024

krammnic commented Oct 24, 2024

krammnic commented Oct 24, 2024

RdoubleA commented Oct 24, 2024

krammnic commented Oct 24, 2024

krammnic commented Oct 26, 2024

krammnic commented Oct 26, 2024

krammnic commented Oct 29, 2024

krammnic commented Nov 5, 2024

RdoubleA commented Nov 13, 2024

RdoubleA commented Nov 18, 2024

pytorch-bot bot commented Oct 19, 2024 •

edited

Loading

codecov-commenter commented Oct 20, 2024 •

edited

Loading

krammnic commented Oct 20, 2024 •

edited

Loading