Memory leak issue #10015
-
How to reproduce the behaviourHi, facing memory leakage issue for the following code.
Your Environment
I have attached the Also would like to know whether spacy is caching the results ? @adrianeboyd any views on this or anything that you can help me out? |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 34 replies
-
The memory usage increases slightly during processing because the pipeline vocab in If you're saving If you're saving the resulting annotation/analysis in some other form, then you probably don't need to keep track of the full string store for all your docs at once. The recommended solution if the memory usage is a problem is to periodically reload the pipeline with |
Beta Was this translation helpful? Give feedback.
-
You can reload the model when it has grown too large:
|
Beta Was this translation helpful? Give feedback.
-
I need help in two aspects on pre-loading.
|
Beta Was this translation helpful? Give feedback.
-
having a built in memory leak seems pretty strange to me. Reloading the model isn't necessarily the easiest thing to do. Is there seriously no way to control the size of the vocab cache? |
Beta Was this translation helpful? Give feedback.
-
I seem to be running into this while training a transformer model on a very large dataset. Memory consumption continues to increase throughout the course of training until it runs out and gives OOM error. It would be annoying to have to stop training periodically and then restart from the saved checkpoint (especially if we wanted to resume the learning rate etc at the point it was before). Is there another solution to this that would work for model training? @adrianeboyd @svlandeg |
Beta Was this translation helpful? Give feedback.
-
It's not vocab problem, even passing the same text causes the memory to grow. |
Beta Was this translation helpful? Give feedback.
-
Hello. How you can recommend organize REST service which planned to serve text requests and extract NER's in response? The main idea to serve without interruption as normal web service. We detected that memory consumption increase as long as service works. |
Beta Was this translation helpful? Give feedback.
-
This is a pretty severe issue for my organization. We host SpaCy-based pipelines that process hundreds of gigabytes every day. One of our primary services uses about 2 GB of memory when it initially boots up, but leaks memory at a rate of 1.45 GB / hour. All it does is extract text from requests, process the text through a basic SpaCy pipeline and convert the results into a response. As a result, we need to reserve 8 GB of memory for each container so that it restarts only every 4 hours. We actually attempted to implement logic to automatically reload the model when it started using too much memory, but it:
We ended up abandoning this approach and just letting Kubernetes kill the pods when they attempt to use memory beyond their limits or the available memory on the machine. We are currently planning to avoid SpaCy for new projects and move away from it for existing projects in order to avoid this issue. In the spirit of trying to suggest solutions when reporting problems, I would love to see the following change to the architecture:
This would almost certainly be a SpaCy V4 change and would be difficult to implement. There are probably problems with the idea that I haven't thought of. But maybe it will get the right gears turning in someone's head to figure out the right solution 🙂 |
Beta Was this translation helpful? Give feedback.
-
This is something that we've gone back and forth on over the years. I would say first of all that yes it's reasonable to call this a memory leak, all things considered, and we need to do something about it. The current behaviour is something of a regression compared to how we used to handle this. Previous versions of spaCy had a limit on the number of However, having some The problem for the API has been that it doesn't know when it can remove data from the I think what we should have is a context manager on the
|
Beta Was this translation helpful? Give feedback.
-
Prerelease up for testing: https://github.com/explosion/spaCy/releases/tag/prerelease-v3.8.0.dev0 |
Beta Was this translation helpful? Give feedback.
The memory usage increases slightly during processing because the pipeline vocab in
nlp.vocab
is not static. The lexeme cache (nlp.vocab
) and string store (nlp.vocab.strings
) grow as texts are processed and new tokens are seen. The lexeme cache is just for speed, but the string store is needed to map the string hashes back to strings.If you're saving
Doc
objects directly for future processing, you'd need the string store cache to know which token strings were in the docs, since theDoc
object just includes the hashes (ints) for the tokens and not the strings.If you're saving the resulting annotation/analysis in some other form, then you probably don't need to keep track of the full stri…