-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset too large #6
Comments
Would it possible to share the sentence that causes the error? I can run it locally to see if I can reproduce the error. |
It is not a sentence, it is a a list of lines from .txt files, so it is essentially a lot of sentences in a list and I want to evaluate them all |
And unfortunately I cannot share the dataset anywhere because it has PHI in it |
Ok. Probably, something to do with the sequence length. Which model are you using to compute perplexity? Also, could you tell me the number of the maximum character in a single line in your file? |
I am fairly sure that it is 512, I think I need help on how to batch my dataset |
If you run it on a single instance, instead of passing a list, you should be able to find the line that causes the error.
|
I am running a version of BERT, but the line that caused issues has only 129 characters in it |
Does the sentence contain only roman characters? |
It has dashes, commas, a period, and a colon, I don't know if those count, but other than that, yes |
I have pretrained BERT on this same dataset before with the same tokenizer and model. |
Could you try to compute perplexity on the same files but with |
Same error, same spot |
But it did take it a little longer to get there although that may be because it was loading a new model in |
So RoBERTa and BERT raises the same error in the same line, correct? |
Correct. |
If you could mask the alphabet in the sentence with a random character, would it be possible to share here? For instance, if the sentence is |
I guess alphabet is not the root cause of the error, and that's why to debug the issue, they are not imporatn. |
Here: |
I can confirm that in my code, running just that sentence, does reproduce the same error, so if that doesn't work for you, it may be that I have accidentally edited the code aside from replacing the transformer. |
It's working fine with the latest lmppl (0.3.1) indeed as below. In [1] from lmppl import LM
In [2] lm = LM('roberta-base')
In [3] text = "Aaaaaaaaa: Aaaaaaaaa aaaaaaa aaaa aaaaaaaa - AAA10-AA A19.94, Aaaaaaaaa aaaaa - AAA10-AA A19.10, Aaaaaaa aaaaaaaaaa - AAA10-"
In [4] lm.get_perplexity(text)
Out[4]: 115701099.01106478 |
So recopied the run_mlm.py and then ran it as is with roberta-base and here is what I got:
|
I just ran your exact code and that worked, can I send you my version of the code? (run_mlm.py I mean) |
Ok so with a little bit of modification, I can run my whole dataset on the code that you sent, now I just need to figure out how to be able to modify that to have a different transformer without breaking it I think. |
Although I do wonder why you chose to use the version for GPT variants rather than the one for BERT variants in your example |
If I import and use MaskedLM rather than LM, it breaks, not sure why though. |
Could you try following instead?
|
Ah wait, you're right. I should have use MaskedLM, but not LM. |
Yeah, it's working without any issue.
|
Ok, so that simple example works, but then when I tried to loop through it like so:
it returns another tensor error on the very same text that we were just testing. It gets past the first two loops but then breaks on the third which is what we have been working on:
|
It does not return any errors on the same exact code but with LM, but it also returns a perplexity number that is wildly incorrect because it isn't the right type of eval for the model |
I am using the run_mlm.py file but I have my own copy because I changed where the tokenizer is going to since it is a different path from the model which is local.
While intially working with this method, I used the first two lines of my dataset and it was working just fine, but now that I have expanded the input, I am getting this error:
The text was updated successfully, but these errors were encountered: