Parallelizing SpaCy's en_core_web_lg Model Across Multiple Nodes Using MPI #13637

hecmar007 · 2024-09-26T16:23:25Z

hecmar007
Sep 26, 2024

Hi,
I’m currently working on extracting subject-verb-object (SVO) triplets from sentences in two input documents using SpaCy. I aim to run this process across multiple computing nodes and have started looking into parallelization using MPI. However, I’ve run into a significant issue regarding memory usage and data sharing between processes.

Problem:

To process the text, I’m using the en_core_web_lg model. The issue is that loading the model on each process causes me to run out of main memory before completing the task. To mitigate this, I tried isolating the parts of the code that require the model into a single process, intending to broadcast the results to other processes for further computations. However, the problem is that the results (a dictionary of tokens) cannot be serialized and shared across processes.

Questions:

Is there a better way to parallelize SpaCy's en_core_web_lg model without duplicating it across multiple processes? For instance, could shared memory be utilized to avoid loading the model separately for each process?
Alternatively, is there a way to avoid serializing the resulting dictionaries of tokens so they can be broadcast or shared between processes?

Here’s a snippet of my code for context:

# Imports, model loading, and MPI setup
nlp = load_nlp_model()  # Loading en_core_web_lg
customize_nlp_model(nlp)

comm_world = MPI.COMM_WORLD
rank = comm_world.Get_rank()
size = comm_world.Get_size()

ref_parsed_doc, cand_parsed_doc = list(nlp.pipe([ref_text, cand_text]))

if rank == 0:
    chunks_ref = chunk_doc(ref_parsed_doc, size)

chunk_ref = comm_world.scatter(chunks_ref, root=0)
ref_svos = process_chunk(chunk_ref, ref_parsed_doc, nlp, t_model)  # Parallel computation

# Gather results from all processes
ref_results = comm_world.gather(ref_svos, root=0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelizing SpaCy's en_core_web_lg Model Across Multiple Nodes Using MPI #13637

{{title}}

Replies: 0 comments

Select a reply

Parallelizing SpaCy's en_core_web_lg Model Across Multiple Nodes Using MPI #13637

hecmar007 Sep 26, 2024

Replies: 0 comments

hecmar007
Sep 26, 2024