Parallelizing SpaCy's en_core_web_lg Model Across Multiple Nodes Using MPI #13637
Unanswered
hecmar007
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I’m currently working on extracting subject-verb-object (SVO) triplets from sentences in two input documents using SpaCy. I aim to run this process across multiple computing nodes and have started looking into parallelization using MPI. However, I’ve run into a significant issue regarding memory usage and data sharing between processes.
Problem:
To process the text, I’m using the en_core_web_lg model. The issue is that loading the model on each process causes me to run out of main memory before completing the task. To mitigate this, I tried isolating the parts of the code that require the model into a single process, intending to broadcast the results to other processes for further computations. However, the problem is that the results (a dictionary of tokens) cannot be serialized and shared across processes.
Questions:
Is there a better way to parallelize SpaCy's en_core_web_lg model without duplicating it across multiple processes? For instance, could shared memory be utilized to avoid loading the model separately for each process?
Alternatively, is there a way to avoid serializing the resulting dictionaries of tokens so they can be broadcast or shared between processes?
Here’s a snippet of my code for context:
Beta Was this translation helpful? Give feedback.
All reactions