Memory leak issue #10015

rkoystart · 2022-01-10T03:34:07Z

rkoystart
Jan 10, 2022

How to reproduce the behaviour

Hi, facing memory leakage issue for the following code.

import spacy
import psutil
import os


if __name__ == "__main__":
     data = open("1l_test.txt","r").read().splitlines() // test file that contains 100000 sentences
     model = spacy.load("en_core_web_md")
     while(True):
       for i in data:
           tokens =  model(i)
           with open("loop_leak.txt","a") as writer:
                process = psutil.Process(os.getpid())
                writer.write(str(process)+"\t"+ str(process.memory_info().rss/1000000)+"\n")
       break

Your Environment

[ravi@ravi spacy_api]$ python -m spacy info --markdown

## Info about spaCy

- **spaCy version:** 3.1.4
- **Platform:** Linux-4.18.0-240.10.1.el8_3.x86_64-x86_64-with-glibc2.10
- **Python version:** 3.8.3
- **Pipelines:** es_core_news_md (3.1.0), it_core_news_md (3.1.0), en_core_web_md (3.1.0), pt_core_news_md (3.1.0), el_core_news_md (3.1.0), fr_core_news_md (3.1.0), nl_core_news_md (3.1.0), de_core_news_md (3.1.0)

Operating System: Centos Linux
Python Version Used: 3.8.3
spaCy Version Used: 3.1.4

I have attached the loop_leak.txt file which has the information about the memory occupied by the process after each sentences being passed through spacy model. The same issue is also happening for spacy sm model for 2.3.4 spacy version also.

Also would like to know whether spacy is caching the results ?

@adrianeboyd any views on this or anything that you can help me out?
loop_leak.txt

Answered by adrianeboyd

Jan 10, 2022

The memory usage increases slightly during processing because the pipeline vocab in nlp.vocab is not static. The lexeme cache (nlp.vocab) and string store (nlp.vocab.strings) grow as texts are processed and new tokens are seen. The lexeme cache is just for speed, but the string store is needed to map the string hashes back to strings.

If you're saving Doc objects directly for future processing, you'd need the string store cache to know which token strings were in the docs, since the Doc object just includes the hashes (ints) for the tokens and not the strings.

If you're saving the resulting annotation/analysis in some other form, then you probably don't need to keep track of the full stri…

View full answer

adrianeboyd · 2022-01-10T11:29:48Z

adrianeboyd
Jan 10, 2022

The memory usage increases slightly during processing because the pipeline vocab in nlp.vocab is not static. The lexeme cache (nlp.vocab) and string store (nlp.vocab.strings) grow as texts are processed and new tokens are seen. The lexeme cache is just for speed, but the string store is needed to map the string hashes back to strings.

If you're saving Doc objects directly for future processing, you'd need the string store cache to know which token strings were in the docs, since the Doc object just includes the hashes (ints) for the tokens and not the strings.

If you're saving the resulting annotation/analysis in some other form, then you probably don't need to keep track of the full string store for all your docs at once. The recommended solution if the memory usage is a problem is to periodically reload the pipeline with spacy.load.

16 replies

rkoystart Mar 1, 2022
Author

Hi @adrianeboyd , currently we have gunicorn+flask (with preload option as true and multiple worker :assume around 12 workers) setup based application which will be getting the text from the input request and passing it to spacy pretrained model and returning the output obtained as the response. Since the gunicorn only forks the workers process , I think initially all the workers will be having the same spacy model. But as the worker starts processing the incoming request, each worker will start having its own copy of the complete spacy model (or atleast some part of the spacy related things) as it sees new tokens and adds them to the vocab. So after certain number of request handling , each worker will be having its own copy of the spacy model( or atleast its own copy of some portions of the spacy model). Imagine now I put a check before the forward pass that if the size of vocab and string store together is greater than say 10 lakh(suggestion provided by @chrishmorris ) , then assign model=spacy.load("en-core-web-md") and then do the forward pass and continue.
Will this affect the flow in the other workers request processing or the accuracy of the model in the other worker or provide some error output etc.
Is the suggestion provided by @chrishmorris preferred by you ?

svlandeg Mar 10, 2022
Maintainer

That suggestion is pretty much the same thing what Adriane wrote in her very first reply:

The recommended solution if the memory usage is a problem is to periodically reload the pipeline with spacy.load.

chrishmorris Mar 11, 2022

Yes, exactly the same, thanks Adriane, my contribution was merely to code it.

jey07 Mar 7, 2023

Hi @adrianeboyd , could you please direct me to the official documentation of entire process on what you had mentioned "lexeme cache (nlp.vocab) and string store (nlp.vocab.strings) grow as texts are processed a..."

thomashacker Mar 9, 2023

Hey jey07, thanks for the question!
We currently do not have this detail in the docs, but we agree that this could be valuable for users. We plan to add it to the docs in the future 👍

chrishmorris · 2022-02-22T15:55:45Z

chrishmorris
Feb 22, 2022

You can reload the model when it has grown too large:

_parse = spacy.load('en_core_sci_lg')
def parse(section):
    global _parse
    if 2**30< _parse.vocab.__sizeof__()+ _parse.vocab.strings.__sizeof__():
        _parse = spacy.load('en_core_sci_lg')
        logging.info('Reloaded the Spacy model')
    return _parse(section)

8 replies

adrianeboyd Mar 16, 2022

Garbage collection is complicated. Objects aren't necessarily garbage-collected immediately, and even if you run garbage collection manually, python doesn't necessarily release all the memory back to the OS.

You can see this with a short example:

import spacy
import os
import psutil
import gc


process = psutil.Process(os.getpid())
print("initial     ", process.memory_info().rss / 1000000)

nlp = spacy.load("en_core_web_lg")
print("load        ", process.memory_info().rss / 1000000)

del nlp
print("del         ", process.memory_info().rss / 1000000)

gc.collect()
print("gc.collect()", process.memory_info().rss / 1000000)

Local output:

initial      381.198336
load         1492.82816
del          1492.82816
gc.collect() 481.816576

It sounds like this may be too many models / too many workers for the amount of RAM? You could also trying automatically restarting workers after a certain number of requests, as mentioned in a related discussion here: #10496 (comment)

rkoystart Jun 14, 2022
Author

Thanks @adrianeboyd Will try it 👍

marzooq-unbxd Oct 3, 2022

Did this solve your issue? (len(self.model.vocab))
@rkoystart

and

@chrishmorris

_parse.vocab.sizeof() is basically a constant number, as far as i have seen

marzooq-unbxd Oct 3, 2022

According to https://stackoverflow.com/questions/59515740/how-to-find-the-vocabulary-size-of-a-spacy-model
_n_tokens_with_vectors = len(nlp.vocab.vectors)
_n_unique_word_vectors = len(nlp.vocab.vectors)
But I think these are static and dont change with time

polm Oct 4, 2022

The number of word vectors doesn't change after loading a model. The vocabulary size will typically grow over time, though the more text you process the more slowly it would be expected to grow, as fewer new words are added. If you have a vocabulary it should get larger if you process a new string, like blargfizzle ZXDFERJLSDROIE or something.

kaliaanup · 2022-10-11T16:40:37Z

kaliaanup
Oct 11, 2022

I need help in two aspects on pre-loading.

How to define a constraint? Defining a memory size is not correct as it is impossible to compute that.
I am running a Python based microservice where I am initiating the spacy model initially. The model seems to grow with incoming messages. I added a constrain e.g., if length of vocab > 100K, Spacy should be reloaded. However after reloading Spacy the vocab size stopped growing. Is there any issue? Did it create a new process or something? Ideally it should continue growing from the beginning. Can anyone check on this?

2 replies

polm Oct 12, 2022

Typically, if the memory use of the Vocabulary is an issue, it's enough to check the size (number of entries) and reload the model after it exceeds a threshold. You can also reload after a certain amount of time like a day or something.

If you reload a model and give it input, the vocabulary size should grow when it sees new tokens. For example, if you pass Zoo8PaiRNaez6sooUo4ahtha the model has probably never seen that before. You should be able to verify this.

Without knowing how your server is actually structured there's not really any way for us to explain why your vocab would stop growing, but it's possible you're using the old object somehow.

kaliaanup Oct 20, 2022

Thanks @polm. Will investigate.

RichJackson · 2023-06-20T21:16:29Z

RichJackson
Jun 20, 2023

having a built in memory leak seems pretty strange to me. Reloading the model isn't necessarily the easiest thing to do. Is there seriously no way to control the size of the vocab cache?

4 replies

svlandeg Jun 26, 2023
Maintainer

Hi Richard, it's not so much a "built in memory leak" rather than a "built in cache" that grows as texts are being processed.

We're aware that it's annoying to have to reload the model to resolve the memory usage. We've looked at alternatives, but ultimately it'd always come down to reloading - we can't just purge the cache because then various components would stop working (e.g. labels that are assumed to be in the cache etc)

holub008 Aug 20, 2024

From the perspective of a web developer integrating spacy into a service, this feels like a memory leak because it's subtle. It's neither apparent from spacy.load nor the top level docs that caching is happening. There is an appropriate warning on the Vocab docs, but I didn't make it there during review (my negligence!).

My application, and I imagine others, is taking ~arbitrary input text which results in unbounded cache growth. I didn't catch the memory issues in dev due to different RAM headspace. This seems like a landmine for anyone integrating spacy into a long-running application. I'm not sure how developers are to learn about this except RTFM, doing heavy-duty load testing in dev, or getting a nagios alert in production. And in the latter cases, tracing this leak is not fun.

Having some level of cache control would be awesome, but I can see how that's asking for a near implementation detail to be surfaced all the way up to top level initialization/loading in the API. And it sounds like it's technically quite difficult.

honnibal Sep 8, 2024
Maintainer

I'm not sure how developers are to learn about this except RTFM, doing heavy-duty load testing in dev, or getting a nagios alert in production. And in the latter cases, tracing this leak is not fun.

FWIW, I won't try to tell you that "RTFM" is a reasonable response here. See below for a suggested patch, which I'll backport to v3 and release as a dev version for testing.

holub008 Sep 8, 2024

Thanks a bunch-- I'll test that out in our application when available!

jricheimer · 2023-08-04T16:58:26Z

jricheimer
Aug 4, 2023

I seem to be running into this while training a transformer model on a very large dataset. Memory consumption continues to increase throughout the course of training until it runs out and gives OOM error.

It would be annoying to have to stop training periodically and then restart from the saved checkpoint (especially if we wanted to resume the learning rate etc at the point it was before).

Is there another solution to this that would work for model training? @adrianeboyd @svlandeg

1 reply

adrianeboyd Aug 7, 2023

Sorry, we don't currently have a solution that would work for model training. I can say that we've sometimes noticed similar problems during transformer training for the provided trf pipelines. The vocab issue could be a part of it, but it doesn't look like it's only the vocab because we haven't noticed similar problems for the same data with the sm/md/lg pipelines. I'm afraid that we've never figured out exactly what was contributing to the memory usage.

Rfank2021 · 2023-10-04T11:11:48Z

Rfank2021
Oct 4, 2023

It's not vocab problem, even passing the same text causes the memory to grow.

1 reply

adrianeboyd Oct 9, 2023

If you have time to test it, it might be useful to try spacy-curated-transformers instead of spacy-transformers, which could potentially help figure out whether the memory problem is in spacy or in another library like spacy-transformers or transformers.

We don't have all the user-friendly scaffolding set up for spacy init config with spacy-curated-transformers yet, but you can use the configs from any of the spacy v3.7 trf models like en_core_web_trf as a starting point. You should only need to swap out the [components.transformers] block for one with curated_transformer instead of transformer. Note that curated_transformer has a number of additional model-specific settings that need to be filled in in advance in the config.

zagorulkinde · 2024-07-19T11:14:45Z

zagorulkinde
Jul 19, 2024

Hello. How you can recommend organize REST service which planned to serve text requests and extract NER's in response? The main idea to serve without interruption as normal web service. We detected that memory consumption increase as long as service works.

0 replies

connorbrinton · 2024-09-06T21:24:12Z

connorbrinton
Sep 6, 2024

This is a pretty severe issue for my organization. We host SpaCy-based pipelines that process hundreds of gigabytes every day. One of our primary services uses about 2 GB of memory when it initially boots up, but leaks memory at a rate of 1.45 GB / hour. All it does is extract text from requests, process the text through a basic SpaCy pipeline and convert the results into a response. As a result, we need to reserve 8 GB of memory for each container so that it restarts only every 4 hours.

We actually attempted to implement logic to automatically reload the model when it started using too much memory, but it:

Ended up stalling all requests while the model was being reloaded, and
Significantly increased memory usage during the reloading process, making it difficult to time properly

We ended up abandoning this approach and just letting Kubernetes kill the pods when they attempt to use memory beyond their limits or the available memory on the machine.

We are currently planning to avoid SpaCy for new projects and move away from it for existing projects in order to avoid this issue.

In the spirit of trying to suggest solutions when reporting problems, I would love to see the following change to the architecture:

OOV Lexemes are "owned" by the Doc, not the Vocab (or maybe all Lexemes are owned by Docs?)

This would almost certainly be a SpaCy V4 change and would be difficult to implement. There are probably problems with the idea that I haven't thought of. But maybe it will get the right gears turning in someone's head to figure out the right solution 🙂

1 reply

honnibal Sep 9, 2024
Maintainer

Prerelease now up on PyPi. Please test: https://github.com/explosion/spaCy/releases/tag/prerelease-v3.8.0.dev0

honnibal · 2024-09-06T21:59:33Z

honnibal
Sep 6, 2024
Maintainer

This is something that we've gone back and forth on over the years. I would say first of all that yes it's reasonable to call this a memory leak, all things considered, and we need to do something about it.

The current behaviour is something of a regression compared to how we used to handle this. Previous versions of spaCy had a limit on the number of Lexeme objects stored in the vocab, with new ones after that being owned by the Doc object. We also had a pretty complicated attempt to divide the string store into new strings and old strings, so that we could flush away new strings.

However, having some Lexeme objects owned by the Doc and others owned by the Vocab caused bugs, so in the name of simplicity this behaviour was changed: #6809 . I believe in the intervening years there's probably also been a change to caching in the tokenizer that may be consuming memory as well.

The problem for the API has been that it doesn't know when it can remove data from the Vocab or StringStore once that data has been added, because the Vocab doesn't know what Doc objects are still around trying to reference the data.

I think what we should have is a context manager on the Language object. On exit of the context, spaCy is allowed to free any memory it's allocated in the meantime. So, something like this:

with nlp.memory_zone():
    docs = list(nlp.pipe(texts))
    do_stuff(docs)
# Accessing 'docs' is invalid now --- spaCy's allowed to free data that backs them at the end of the context

1 reply

honnibal Sep 8, 2024
Maintainer

5d0d2de

honnibal · 2024-09-09T14:20:22Z

honnibal
Sep 9, 2024
Maintainer

Prerelease up for testing: https://github.com/explosion/spaCy/releases/tag/prerelease-v3.8.0.dev0

0 replies

Memory leak issue #10015

How to reproduce the behaviour

Your Environment

Replies: 10 comments · 34 replies

rkoystart Mar 1, 2022 Author

svlandeg Mar 10, 2022 Maintainer

rkoystart Jun 14, 2022 Author

svlandeg Jun 26, 2023 Maintainer

honnibal Sep 8, 2024 Maintainer

honnibal Sep 9, 2024 Maintainer

honnibal Sep 6, 2024 Maintainer

honnibal Sep 8, 2024 Maintainer

honnibal Sep 9, 2024 Maintainer

Replies: 10 comments 34 replies

rkoystart Mar 1, 2022
Author

svlandeg Mar 10, 2022
Maintainer

rkoystart Jun 14, 2022
Author

svlandeg Jun 26, 2023
Maintainer

honnibal Sep 8, 2024
Maintainer

honnibal Sep 9, 2024
Maintainer

honnibal
Sep 6, 2024
Maintainer

honnibal Sep 8, 2024
Maintainer

honnibal
Sep 9, 2024
Maintainer