-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert OpenMP parallelization to OneAPI::TBB #6626
base: main
Are you sure you want to change the base?
Conversation
Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes. |
Ok, I ran my tests in my development environment. Guess I should use the docker containers to replicate the CI environment and figure out those tests. |
@dbs4261 Thanks for working on this! I just tested this PR on my Mac and got numerous TBB related compilation errors. I tried using the Homebrew version of TBB as well as the "build from source" configuration. There appear to be functions that this PR uses that are missing from the Homebrew and "build from source" versions of TBB on Mac. I know this PR is still draft but wanted to report what I had found. Please let me know if you need any help testing/diagnosing issues on Mac. |
Hi @ssheorey this PR likely wont fix that issue as I haven't yet changed how the TBB dependency is being accessed. This is likely also why @errissa is facing issues building on Mac. @errissa is homebrew pulling the OneAPI version of TBB? If you can provide me with the version of TBB you tried and the compiler errors I can take a look and figure out which version is required and work that into the PR. |
@dbs4261 yes, you are right about not fixing #6544. We should update to the latest oneTBB as part of this PR to fix that though. This is the latest version of oneTBB and is available for all platforms on github: https://github.com/oneapi-src/oneTBB/releases/tag/v2021.11.0 The naming is off - this was released in Nov 2023. I think this should also resolve @errissa 's issues on macOS. |
I agree that setting the version requirement for TBB should be part of this PR. Based on the ubuntu failure in CI, its the |
Looks like the minimum version requirement for |
Hi @dbs4261 , our usual policy is to upgrade to the latest version available, but set minimum version to what is required to make everything work. This helps to "future-proof" the updated code as much as possible by incorporating the latest bugfixes. Official binaries will be built with the latest version, but also allows the library to build on older versions by users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Initial look]
3rdparty/mkl/tbb.cmake
Outdated
@@ -26,13 +26,10 @@ find_package(Git QUIET REQUIRED) | |||
ExternalProject_Add( | |||
ext_tbb | |||
PREFIX tbb | |||
URL https://github.com/wjakob/tbb/archive/141b0e310e1fb552bdca887542c9c1a8544d6503.tar.gz # Sept 2020 | |||
URL_HASH SHA256=bb29b76eabf7549660e3dba2feb86ab501469432a15fb0bf2c21e24d6fbc4c72 | |||
URL https://github.com/oneapi-src/oneTBB/archive/refs/tags/v2021.4.0.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we upgrade to the latest? v2021.11.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason why not. I just put in the older version that had all the features I used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a merge conflict here. The CI can run only after it's fixed.
func(i); | ||
} | ||
tbb::parallel_for(tbb::blocked_range<int64_t>(0, n, 32), | ||
[&func](const tbb::blocked_range<int64_t>& range) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many threads will be used here? Currently, it's estimated with utility::EstimateMaxThreads()
which gives us one thread per core (excluding hyperthreading).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, avoid using "magic numbers" (32). I think you have a GetDefaultChunkSize()
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will use up to the number of threads in task arena that called it. As for the chunk size, see my other comment.
return ""; | ||
} | ||
} | ||
int EstimateMaxThreads() { return tbb::this_task_arena::max_concurrency(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the number of cores (not number of HW threads)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the number of tasks is determined by the caller. A caller could be using a small task arena to deal with IO, while a larger arena deals with processing something else. This actually brings up an issue that I don't yet know how to solve. TBB sets the maximum concurrency with a C++ variable that follows scope rules but doesn't need to be passed to functions. So I don't know how a python user would set the concurrency limit yet. I think it might need to be done with some sort of context manager. But I guess this change behavior in an environment where the number of threads was limited with the OpenMP environment variable.
return 1; | ||
#endif | ||
std::size_t& DefaultGrainSizeTBB() noexcept { | ||
static std::size_t GrainSize = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on how this value was selected? Did you see any performance differences for this value versus other values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I was guessing at grain size from this, but it really should be picked based off of profiling. My understanding is that the grain size provides loose guidance to TBB's automatic chunking mechanism. It works similarly to omp schedule(guided)
. Overall the goal is to provide plenty of work to each thread so the overhead of chunking is minimized, but small enough chunks that the scheduler can go back in a steal some if one of the threads gets held up. It might be worth taking another pass through the grain sizes that I put in and set them as a magic number times the DefaultGrainSizeTBB
(which is mutable). That way the chunk size could be higher for doing a single operation with tensors, and smaller when looping through complex sections like in RANSAC.
[Notes about linking and binary distribution] For linking TBB, recommendation is to link dynamically. For C++ binaries and applications, we will distribute TBB DLL along with the Open3D DLL. For Python, TBB libraries are available through PyPI, so we can add these as dependencies to |
@@ -15,30 +18,57 @@ namespace utility { | |||
class ProgressBar { | |||
public: | |||
ProgressBar(size_t expected_count, | |||
const std::string &progress_info, | |||
std::string progress_info, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has the const been removed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to be copied into the object, so it passed by value into the constructor and then by move into the member variable.
…gh not using the oneapi scope). Untested but building.
…in conjunction with tbb parallel constructs.
…of self intersecting triangles.
…he progress bar into its own function and a bulk inplace add function operator+=. Also added TBBProgressBar. It does not inherit from ProgressBar as it uses an atomic for counting and has slightly different internals to use that atomicity.
…gress bar to limit spinning on the mutex.
…versions of format wont automatically convert it to its underlying type.
…::spin_mutex::scoped_lock.
…e done to prevent assignment to the output pointer.
… global mutex from utilities::random.
Is there anything that can be done to fix the two failing tests?
|
I can take a look, they are both in the python library, correct? |
Yes, I guess so. You can find them here:
|
OpenMP acceleration has been migrated to use oneapi::TBB.
Type
Motivation and Context
Many components of Open3D imply an eventual shift away from OpenMP to TBB. This includes some sections where tbb is only used on one platform as 2D loop unrolling isn't supported on Win32. Lastly, by using multiple parallelization paradigms, nested parallelism is problematic. When using some Open3D methods from a TBB context, an OpenMP thread pool is created for each TBB thread.
Checklist:
python util/check_style.py --apply
to apply Open3D code styleto my code.
updated accordingly.
results (e.g. screenshots or numbers) here.
Description
Updated parallel for sections to use tbb::parallel_for. Adapted most loops that performed reductions with either omp reduction clauses or with critical sections to tbb::parallel_reduce implementations. Some of which required custom reduction objects instead of using lambdas. Added an atomic version of the ProgressBar for use with TBB.
There is still work to be done in documentation. This will break any user code that directly uses
ParallelForCPU
as OpenMP critical sections will no longer work. Additionally, TBB has no approach for setting the maximum number of threads like OpenMP does withOMP_NUM_THREADS
. In C++ code atbb::global_control
object could be used, but it is unclear to how to provide that sort of functionality for python users.