Memory leak of MorphAnalysis object. #13684

hynky1999 · 2024-11-04T18:18:58Z

I have encountered a crucial bug, which makes running a continuous tokenization using Japanese tokenizer close to impossible. It's all due so memory leak of MorphAnalysis

How to reproduce the behaviour

import spacy
import tracemalloc


tracemalloc.start()
tokenizer = spacy.blank("ja")
tokenizer.add_pipe("sentencizer")

for _ in range(1000):
    text = " ".join(["a"] * 1000)
    snapshot = tracemalloc.take_snapshot()
    with tokenizer.memory_zone():
        doc = tokenizer(text)
        tokenizer.max_length = len(text) + 10
    import gc
    gc.collect()
    snapshot2 = tracemalloc.take_snapshot()
    # Compare the two snapshots
    p_stats = snapshot2.compare_to(snapshot, "lineno")
    # Pretty print the top 10 differences
    print("[ Top 10 ]")
    # Stop here with pdb
    for stat in p_stats[:10]:
        if stat.size_diff > 0:


            print(stat)

Run this script and observe how memory keeps growing:

It all happens due to the this line:
token.morph = MorphAnalysis(self.vocab, morph). I have checked the implementation itself and there is neither code for dealocation implemented, nor it supports the memory_zone.

The text was updated successfully, but these errors were encountered:

lise-brinck · 2024-11-15T09:47:57Z

We have observed similar issues in our pipeline. As you can see in this minimal example with da_core_news_md-model, the vocab keeps growing:

nlp = spacy.load("da_core_news_md")

test_texts = [
    "Varmere vintre: Flere trækfugle forurener søerne",
    "De højere vintertemperaturer giver problemer for landets søer.",
    "Blandt andet fordi flere trækfugle sover på vandet.",
    "I 1980'erne var der omkring 200 grågæs i Danmark om vinteren.",
    "I dag kan der være helt op mod 100.000.",
]

for text in test_texts:
    print("Vocab size before nlp:", len(nlp.vocab))
    with nlp.memory_zone():
        doc = nlp(text)
        print("Vocab size after nlp:", len(nlp.vocab))
    print("Vocab size out of memory zone:", len(nlp.vocab))

Output:

Vocab size before nlp: 2269
Vocab size after nlp: 2275
Vocab size out of memory zone: 2275
Vocab size before nlp: 2275
Vocab size after nlp: 2283
Vocab size out of memory zone: 2283
Vocab size before nlp: 2283
Vocab size after nlp: 2291
Vocab size out of memory zone: 2291
Vocab size before nlp: 2291
Vocab size after nlp: 2300
Vocab size out of memory zone: 2300
Vocab size before nlp: 2300
Vocab size after nlp: 2308
Vocab size out of memory zone: 2308

When trying to modify and access MorphAnalysis, an error occurs with hash in StringStore:

for text in test_texts:
    with nlp.memory_zone():
        doc = nlp(text)
        for token in doc:
            morph_str = str(token.morph)
            if "Definite" in morph_str:
                definite = token.morph.get("Definite")[0]
                new_morph_str = morph_str.replace(definite, "foo")
                token.set_morph(new_morph_str)
            token.morph.get("Definite")

Output:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[24], [line 20](vscode-notebook-cell:?execution_count=24&line=20)
     [18](vscode-notebook-cell:?execution_count=24&line=18)     new_morph_str = morph_str.replace(definite, "foo")
     [19](vscode-notebook-cell:?execution_count=24&line=19)     token.set_morph(new_morph_str)
---> [20](vscode-notebook-cell:?execution_count=24&line=20) token.morph.get("Definite")

File ~/.venv/lib/python3.11/site-packages/spacy/tokens/morphanalysis.pyx:71, in spacy.tokens.morphanalysis.MorphAnalysis.get()

File ~/.venv/lib/python3.11/site-packages/spacy/strings.pyx:162, in spacy.strings.StringStore.__getitem__()

KeyError: "[E018] Can't retrieve string for hash '6324204924076910789'. This usually refers to an issue with the `Vocab` or `StringStore`."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak of MorphAnalysis object. #13684

Memory leak of MorphAnalysis object. #13684

hynky1999 commented Nov 4, 2024

lise-brinck commented Nov 15, 2024 •

edited

Loading

Memory leak of MorphAnalysis object. #13684

Memory leak of MorphAnalysis object. #13684

Comments

hynky1999 commented Nov 4, 2024

How to reproduce the behaviour

lise-brinck commented Nov 15, 2024 • edited Loading

lise-brinck commented Nov 15, 2024 •

edited

Loading