Change from concurrent.futures to multiprocessing #354

donglihe-hub · 2024-01-23T10:38:53Z

What does this PR do?

Change from a higher level interface to a lower one because:

As described in concurrent.futures.ProcessPoolExecutor pool deadlocks when submitting many tasks python/cpython#105829, when there are many tasks submitted to a concurrent.futures.ProcessPoolExecutor pool, there is probability that deadlocks will occur with CPython. The same example run with multiprocessing.pool.Pool had no such problem.
The issue has been fixed in Python>=3.11.6 and >=1.12.1. For the sake of compatibility, I rewrote the parallel tokenization using multiprocessing.

Trying to find out the best num_processes

I tested tokenization with various num_processes (using RegexTokenization):

linux, fork

num_processes	AmazonCat-13K	EUR-Lex	Wiki10-31K
no parallel	108.49 s	6.66 s	10.31 s
2	92.31 s	5.81 s	9.92 s
4	61.62 s	3.80 s	6.01 s
8	59.44 s	3.20 s	4.65 s
16	49.05 s	3.90 s	5.72 s
32	52.18 s	3.44 s	4.96 s
64	57.61 s	5.37 s	7.09 s
128	86.96 s	8.54 s	11.16 s
256	119.51 s	14.57 s	19.10 s

Adds-on: I re-ran the codes again. This time 16 had the best perfomance in all cases.

num_processes	AmazonCat-13K	EUR-Lex	Wiki10-31K
no parallel	97.36 s	6.50 s	10.57 s
8	62.09 s	4.09 s	6.29 s
16	42.34 s	3.59 s	5.67 s
32	69.60 s	4.72 s	6.70 s

Based on the results, I believe 16 is a reasonable choice for num_processes. For small datasets, a difference of 1 to 2 seconds is negligible. For large datasets like AmazonCat-13K, 16 has the least running time than other settings.

Having said that, the results are device- and system-specific. This means the choice for num_processes might be different, for example, on intel CPU or on Windows (I'm using AMD server CPU and Linux).

I tested multiprocessing on Windows. Since on Windows and MacOS doesn't has "fork" as start method, the running is longer using "spawn" as start method (spawn takes more time to start than fork).

win32, spawn

num_processes	EUR-Lex	Wiki10-31K
no parallel	6.02 s	15.61 s
2	14.85 s	23.68 s
4	17.57 s	19.42 s
8	20.08 s	23.44 s
12	26.21 s	35.86 s

I also tested spawn on linux

linux, spawn

num_processes	EUR-Lex	Wiki10-31K
no parallel	6.61 s	10.64 s
2	8.41 s	12.67 s
4	6.07 s	8.60 s
8	5.88 s	8.65 s
12	5.36 s	6.81 s
16	5.42 s	6.85 s
32	6.36 s	8.86 s

It turned out the support for multiprocessing is more complicated than I think. So I'll limited the use of multiprocessing on Linux only.

Test CLI & API (`bash tests/autotest.sh`)

Test APIs used by main.py.

Test Pass
- (Copy and paste the last outputted line here.)
Not Applicable (i.e., the PR does not include API changes.)

Check API Document

If any new APIs are added, please check if the description of the APIs is added to API document.

API document is updated (linear, nn)
Not Applicable (i.e., the PR does not include API changes.)

Test quickstart & API (`bash tests/docs/test_changed_document.sh`)

If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.

donglihe-hub force-pushed the multiprocessing branch 3 times, most recently from 5c58aa6 to 206f6f5 Compare January 23, 2024 11:55

donglihe-hub marked this pull request as ready for review January 23, 2024 12:13

donglihe-hub requested review from cjlin1, Eleven1Liu, henryyang42, JamesLYC88 and Gordon119 as code owners January 23, 2024 12:13

donglihe-hub force-pushed the multiprocessing branch 5 times, most recently from 7941e41 to 763d476 Compare January 24, 2024 09:51

change to lower API

271fe7a

donglihe-hub force-pushed the multiprocessing branch from 763d476 to 271fe7a Compare January 24, 2024 10:59

Eleven1Liu mentioned this pull request Feb 8, 2024

Revert "Speed Up Tokenization Through Multiprocessing" #358

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change from concurrent.futures to multiprocessing #354

Change from concurrent.futures to multiprocessing #354

donglihe-hub commented Jan 23, 2024 •

edited

Loading

Change from concurrent.futures to multiprocessing #354

Are you sure you want to change the base?

Change from concurrent.futures to multiprocessing #354

Conversation

donglihe-hub commented Jan 23, 2024 • edited Loading

What does this PR do?

Trying to find out the best num_processes

Test CLI & API (bash tests/autotest.sh)

Check API Document

Test quickstart & API (bash tests/docs/test_changed_document.sh)

donglihe-hub commented Jan 23, 2024 •

edited

Loading

Test CLI & API (`bash tests/autotest.sh`)

Test quickstart & API (`bash tests/docs/test_changed_document.sh`)