-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak of MorphAnalysis object. #13684
Comments
We have observed similar issues in our pipeline. As you can see in this minimal example with da_core_news_md-model, the vocab keeps growing: nlp = spacy.load("da_core_news_md")
test_texts = [
"Varmere vintre: Flere trækfugle forurener søerne",
"De højere vintertemperaturer giver problemer for landets søer.",
"Blandt andet fordi flere trækfugle sover på vandet.",
"I 1980'erne var der omkring 200 grågæs i Danmark om vinteren.",
"I dag kan der være helt op mod 100.000.",
]
for text in test_texts:
print("Vocab size before nlp:", len(nlp.vocab))
with nlp.memory_zone():
doc = nlp(text)
print("Vocab size after nlp:", len(nlp.vocab))
print("Vocab size out of memory zone:", len(nlp.vocab)) Output: Vocab size before nlp: 2269
Vocab size after nlp: 2275
Vocab size out of memory zone: 2275
Vocab size before nlp: 2275
Vocab size after nlp: 2283
Vocab size out of memory zone: 2283
Vocab size before nlp: 2283
Vocab size after nlp: 2291
Vocab size out of memory zone: 2291
Vocab size before nlp: 2291
Vocab size after nlp: 2300
Vocab size out of memory zone: 2300
Vocab size before nlp: 2300
Vocab size after nlp: 2308
Vocab size out of memory zone: 2308 When trying to modify and access MorphAnalysis, an error occurs with hash in StringStore: for text in test_texts:
with nlp.memory_zone():
doc = nlp(text)
for token in doc:
morph_str = str(token.morph)
if "Definite" in morph_str:
definite = token.morph.get("Definite")[0]
new_morph_str = morph_str.replace(definite, "foo")
token.set_morph(new_morph_str)
token.morph.get("Definite") Output: ---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[24], [line 20](vscode-notebook-cell:?execution_count=24&line=20)
[18](vscode-notebook-cell:?execution_count=24&line=18) new_morph_str = morph_str.replace(definite, "foo")
[19](vscode-notebook-cell:?execution_count=24&line=19) token.set_morph(new_morph_str)
---> [20](vscode-notebook-cell:?execution_count=24&line=20) token.morph.get("Definite")
File ~/.venv/lib/python3.11/site-packages/spacy/tokens/morphanalysis.pyx:71, in spacy.tokens.morphanalysis.MorphAnalysis.get()
File ~/.venv/lib/python3.11/site-packages/spacy/strings.pyx:162, in spacy.strings.StringStore.__getitem__()
KeyError: "[E018] Can't retrieve string for hash '6324204924076910789'. This usually refers to an issue with the `Vocab` or `StringStore`." |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have encountered a crucial bug, which makes running a continuous tokenization using Japanese tokenizer close to impossible. It's all due so memory leak of MorphAnalysis
How to reproduce the behaviour
Run this script and observe how memory keeps growing:
It all happens due to the this line:
token.morph = MorphAnalysis(self.vocab, morph)
. I have checked the implementation itself and there is neither code for dealocation implemented, nor it supports the memory_zone.The text was updated successfully, but these errors were encountered: