Change from concurrent.futures to multiprocessing #354
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Change from a higher level interface to a lower one because:
concurrent.futures.ProcessPoolExecutor
pool deadlocks when submitting many tasks python/cpython#105829, when there are many tasks submitted to aconcurrent.futures.ProcessPoolExecutor
pool, there is probability that deadlocks will occur with CPython. The same example run withmultiprocessing.pool.Pool
had no such problem.Trying to find out the best num_processes
I tested tokenization with various num_processes (using RegexTokenization):
linux, fork
Adds-on: I re-ran the codes again. This time 16 had the best perfomance in all cases.
Based on the results, I believe 16 is a reasonable choice for num_processes. For small datasets, a difference of 1 to 2 seconds is negligible. For large datasets like AmazonCat-13K, 16 has the least running time than other settings.
Having said that, the results are device- and system-specific. This means the choice for num_processes might be different, for example, on intel CPU or on Windows (I'm using AMD server CPU and Linux).
I tested multiprocessing on Windows. Since on Windows and MacOS doesn't has "fork" as start method, the running is longer using "spawn" as start method (spawn takes more time to start than fork).
win32, spawn
I also tested spawn on linux
linux, spawn
It turned out the support for multiprocessing is more complicated than I think. So I'll limited the use of multiprocessing on Linux only.
Test CLI & API (
bash tests/autotest.sh
)Test APIs used by main.py.
Check API Document
If any new APIs are added, please check if the description of the APIs is added to API document.
Test quickstart & API (
bash tests/docs/test_changed_document.sh
)If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.