Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix get_stack_base in linux #1389

Closed

Conversation

bongkyu7-kim
Copy link
Contributor

Description

In case of pthread_attr_getstack() returns np_stack_size = 1MB,

  • calculated stack_base = stack_limit + stack_size(default: 4MB)
  • real stack_base = stack_limit + np_stack_size(1MB)

stack_base is wrong because calculated with 4MB default stack size.

Due to this wrong stack_base, task_dispatcher::can_steal() can be always false.
It causes abnormal long loop in task_dispatcher::receive_or_steal_task().

Correct this problem using np_stack_size instead of stack_size.

Fixes # - issue number(s) if exists

  • - git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details)

Type of change

Choose one or multiple, leave empty if none of the other choices apply

Add a respective label(s) to PR if you have permissions

  • bug fix - change that fixes an issue
  • new feature - change that adds functionality
  • tests - change in tests
  • infrastructure - change in infrastructure and CI
  • documentation - documentation update

Tests

  • added - required for new features and some bug fixes
  • not needed

Documentation

  • updated in # - add PR number
  • needs to be updated
  • not needed

Breaks backward compatibility

  • Yes
  • No
  • Unknown

Notify the following users

List users with @ to send notifications

Other information

In case of pthread_attr_getstack() returns np_stack_size = 1MB,
- calculated stack_base = stack_limit + stack_size(default: 4MB)
- real stack_base = stack_limit + np_stack_size(1MB)
stack_base is wrong because calculated with 4MB default stack size.

Due to this wrong stack_base, task_dispatcher::can_steal() can be always false.
It causes abnormal long loop in task_dispatcher::receive_or_steal_task().

Correct this problem using np_stack_size instead of stack_size.

Signed-off-by: Bongkyu Kim <[email protected]>
@pavelkumbrasev
Copy link
Contributor

Hi @bongkyu7-kim, was the stack size somehow changed or it is smaller than 4mb on your system by default?
If so we need to look at https://github.com/oneapi-src/oneTBB/blob/50d56cc679b395594c5b6213198bd30d1a1bfe1f/src/tbb/misc.h#L61 first because this constant will be used as default value for all threads creation.
I would also suggest trying global_control with parameter for the stack size it will override the default value of stack size to appropriate one.

@bongkyu7-kim
Copy link
Contributor Author

Hi @pavelkumbrasev,
My system is android. The default stack size of thread is 1mb.
Although can adjust to global_control, but I think it will be difficult to set it to various stack sizes for each system and environment in the library. As suggested in the patch, if stack_base is calculated with np_stack_size obtained through pthread_attr_getstack, it seems that the correct stack_base can be obtained in any environment.

@pavelkumbrasev
Copy link
Contributor

The problem is that it will just override a part of defined behavior. TBB will still create and treat threads like they have stack size of 4 mb.
That's why the best place is to modify the default value in misc.h. Is the stack size known during compilation if yes we can branch it with macros into misc.h.

@bongkyu7-kim
Copy link
Contributor Author

Even if the stack size is 8mb, the current stack_base is miscalculated.

  1. stack size 8MB case
    real stack_base = stack_limit(4MB) + np_stack_size(8MB) = 12MB
    calculated stack_base = stack_limit(4MB) + stack_size(4MB) = 8MB
    -> located below 4MB than real stack_base

  2. stack size 1MB case
    real stack_base = stack_limit(4MB) + np_stack_size(1MB) = 5MB
    calculated stack_base = stack_limit(4MB) + stack_size(4MB) = 8MB
    -> located above 3MB than real stack_base

Why does stack_limit use the return value of pthread_attr_getstack() and stack_size use the defined value on the code?
I think this is a miscalculation.
If you are not going to apply this patch, it would be better to get stack_base based on the current anchor address without pthread_attr_getstack() call like non-linux system.

@pavelkumbrasev
Copy link
Contributor

I understand that the patch is doing the right thing. The only problem is that it will hide the real problem: default stack value is still incorrect for your platform. Moreover, this default value will be used to create worker threads and for all the calculation unless you explicitly use global_control.
I would like to solve the problem but not the side effects. That's why I'm asking if it is possible to get stack size during the oneTBB compilation for your platform so we can propagate the value into library and it will be used for everything.

@bongkyu7-kim
Copy link
Contributor Author

I understood.
As I know, there is no way to know the stack size at compile time because can change by pthread_set_stacksize().
The only way seems to use np_stack_size from pthread_attr_getstack.
If global_config is not explicitly used, how about propagating this np_stack_size value?

@pavelkumbrasev
Copy link
Contributor

If global_config is not explicitly used, how about propagating this np_stack_size value?

As you said it might be affected by pthread_set_stacksize() and then the incorrect value might be propagated.
I think the proper solution to this problem is to explicitly set global_control it will solve the problem for specific architectures that has default stack size different from 4 mb.

@bongkyu7-kim
Copy link
Contributor Author

In fact, my problem is that geekbench benchmark(using the TBB library) score is dropped in our android product due to this wrong stack_base.
I would like to request to apply to geekbench after fixing this TBB bug.
Geekbench also is cross-platform benchmark. It does not seem easy to set stack size for various environments through global_config.
The current status does not match both stack_base and stack_size.
First of all, I would like to correct the stack_base for solving the performance drop problem.

@pavelkumbrasev
Copy link
Contributor

I don't think geekbench needs to apply global_control for all of the platforms. Perhaps only for platforms with unique stack sizes.
Right now it is not a bug but expected behavior moreover I think we can add the runtime assertion that stack size is equal to np_stack_size otherwise it is UB.
@akukanov what do you think?

@bongkyu7-kim
Copy link
Contributor Author

Because it can be changed at runtime and can have various stack sizes for each thread on the same platform,
we can't determine the stack size by platform like "#ifdef android".
As my understand, get_stack_base() can be called not only in the worker of TBB but also in the main thread generated in OS.
So, this is not android only problem. Other linux also can be problem at thread with pthread_set_stacksize(1MB).
Many linux's default ulimit stack size is 8MB. -> This case also stack base is miscalculated and stack size is mismatched with default 4mb.

@bongkyu7-kim
Copy link
Contributor Author

@pavelkumbrasev
As you said, it would be better to set global_control in unique stack size such as 1mb. I will try this.
However, the current get_stack_base is miscalculating the stack_base in other stack sizes such as 8mb.
So, please consider fix it like this PR. (stack_base = stack_limit + np_stack_size)

@pavelkumbrasev
Copy link
Contributor

However, the current get_stack_base is miscalculating the stack_base in other stack sizes such as 8mb. So, please consider fix it like this PR. (stack_base = stack_limit + np_stack_size)

As I said it happens because default value is different from systems default this should be addressed there. We will discuss it and try to come up with a solution.

@bongkyu7-kim
Copy link
Contributor Author

In my opinion, there're two ways. How about these solutions?

  1. update default stack size by pthread_attr_getstacksize()
    -> If global_control::thread_stack_size is not defined, the stack sizes of all threads are same as the starting thread(OS generated).
    The example code is as below.
--- a/src/tbb/global_control.cpp
+++ b/src/tbb/global_control.cpp
@@ -104,8 +104,20 @@ class alignas(max_nfs_size) stack_size_control : public control_storage {
             return hi - lo;
         }();
         return ThreadStackSizeDefault;
+#else
+#if __linux__ && !__bg__
+        pthread_attr_t np_attr_stack;
+        size_t np_stack_size = 0;
+        if (0 == pthread_getattr_np(pthread_self(), &np_attr_stack)) {
+            if (0 == pthread_attr_getstacksize(&np_attr_stack, &np_stack_size)) {
+                __TBB_ASSERT( np_stack_size > 0, "stack size must be positive" );
+            }
+            pthread_attr_destroy(&np_attr_stack);
+        }
+        return np_stack_size > 0 ? np_stack_size : ThreadStackSize;
 #else
         return ThreadStackSize;
+#endif /* __linux__ */
 #endif
     }
     void apply_active(std::size_t new_active) override {
  1. update stack_size when get_stack_base()
    -> If global_control::thread_stack_size is not defined, correct stealing_threshold for the starting thread(OS generated).
    The stack sizes of sub-threads are the default value in misc.h (4mb).
    The example code is as below.
--- a/src/tbb/governor.cpp
+++ b/src/tbb/governor.cpp
@@ -137,7 +137,7 @@ bool governor::does_client_join_workers(const rml::tbb_client &client) {
     3) If the user app strives to conserve the memory by cutting stack size, it
     should do this for TBB workers too (as in the #1).
 */
-static std::uintptr_t get_stack_base(std::size_t stack_size) {
+static std::uintptr_t get_stack_base(std::size_t &stack_size) {
     // Stacks are growing top-down. Highest address is called "stack base",
     // and the lowest is "stack limit".
 #if __TBB_USE_WINAPI
@@ -165,7 +165,8 @@ static std::uintptr_t get_stack_base(std::size_t stack_size) {
 #endif /* __linux__ */
     std::uintptr_t stack_base{};
     if (stack_limit) {
-        stack_base = reinterpret_cast<std::uintptr_t>(stack_limit) + stack_size;
+        stack_base = reinterpret_cast<std::uintptr_t>(stack_limit) + np_stack_size;
+        stack_size = np_stack_size;
     } else {
         // Use an anchor as a base stack address.
         int anchor{};

@bongkyu7-kim
Copy link
Contributor Author

Is there any update? I'm still waiting for this issue fix.

@pavelkumbrasev
Copy link
Contributor

Is there any update? I'm still waiting for this issue fix.

As I said the work around is there with global_control. We will discuss appropriate solution within a team.

@bongkyu7-kim
Copy link
Contributor Author

Thanks for reply. I will wait for your solution.

@pavelkumbrasev
Copy link
Contributor

Hi @bongkyu7-kim, we have discussed possible solutions.
Basically, the solution you have proposed:

update default stack size by pthread_attr_getstacksize() -> If global_control::thread_stack_size is not defined, the stack sizes of all threads are same as the starting thread(OS generated).

Should be good enough to fix the problem. The only difference is instead of checking stack size of a particular thread system value should be checked. As far as I know it can be done using rlimit and specifically RLIMIT_STACK.

What do you think?

@bongkyu7-kim
Copy link
Contributor Author

@pavelkumbrasev,
RLIMIT_STACK is different from pthread stack size. (in my case, RLIMIT_STACK is 8mb, but pthread stack size 1mb)
So, it don't solve stack size mismatch problem.

But, I think that default stack size by RLIMIT_STACK and correct get_stack_base() like this PR patch can be one solution.
Because pthread stack API is non-portable way, just checking stack usage by RLIMIT_STACK.

@pavelkumbrasev
Copy link
Contributor

@bongkyu7-kim, could you please clarify:

in my case, RLIMIT_STACK is 8mb, but pthread stack size 1mb

As I understand the main thread will be created with RLIMIT_STACK size by default. Is the stack somehow changed?

@bongkyu7-kim
Copy link
Contributor Author

@pavelkumbrasev,
main thread(8mb stack) -> create pthread(1mb stack) -> oneTBB API called from this pthread.

@pavelkumbrasev
Copy link
Contributor

@pavelkumbrasev, main thread(8mb stack) -> create pthread(1mb stack) -> oneTBB API called from this pthread.

Should then oneTBB workers incapsulate this stack property? (Should they also have 1mb stack or default 8mb or 4 as for now)

@bongkyu7-kim
Copy link
Contributor Author

@pavelkumbrasev, main thread(8mb stack) -> create pthread(1mb stack) -> oneTBB API called from this pthread.

Should then oneTBB workers incapsulate this stack property? (Should they also have 1mb stack or default 8mb or 4 as for now)

main thread(8mb) -> create pthread(1mb) - oneTBB API called : oneTBB main thread working here -> other oneTBB workers (4mb by misc.h)

@pavelkumbrasev
Copy link
Contributor

main thread(8mb) -> create pthread(1mb) - oneTBB API called : oneTBB main thread working here -> other oneTBB workers (4mb by misc.h)

What I am asking: is it the desired behavior or not? Should oneTBB workers incapsulate initializing thread property and also use 1 mb stack size?

@bongkyu7-kim
Copy link
Contributor Author

main thread(8mb) -> create pthread(1mb) - oneTBB API called : oneTBB main thread working here -> other oneTBB workers (4mb by misc.h)

What I am asking: is it the desired behavior or not? Should oneTBB workers incapsulate initializing thread property and also use 1 mb stack size?

Because oneTBB workers stack size are mixed (1mb/4mb), it may cause other problem.
I think that oneTBB workers incapsulate this stack property is better.

@bongkyu7-kim
Copy link
Contributor Author

@pavelkumbrasev,
This PR have been open for a long time. Now, I'd like to come to a conclusion after discusstion.

@pavelkumbrasev
Copy link
Contributor

@bongkyu7-kim, sorry for the long response.
We agreed that it is make sense to save appropriate stack size into probably thread's TLS.
We also agreed that stack size should not be propagated to worker threads since there is explicit API to control it.

Could you please work on this PR?

@bongkyu7-kim
Copy link
Contributor Author

@pavelkumbrasev,
For confirmation, do you want like the below patch? (my suggested)

--- a/src/tbb/governor.cpp
+++ b/src/tbb/governor.cpp
@@ -137,7 +137,7 @@ bool governor::does_client_join_workers(const rml::tbb_client &client) {
     3) If the user app strives to conserve the memory by cutting stack size, it
     should do this for TBB workers too (as in the #1).
 */
-static std::uintptr_t get_stack_base(std::size_t stack_size) {
+static std::uintptr_t get_stack_base(std::size_t &stack_size) {
     // Stacks are growing top-down. Highest address is called "stack base",
     // and the lowest is "stack limit".
 #if __TBB_USE_WINAPI
@@ -165,7 +165,8 @@ static std::uintptr_t get_stack_base(std::size_t stack_size) {
 #endif /* __linux__ */
     std::uintptr_t stack_base{};
     if (stack_limit) {
-        stack_base = reinterpret_cast<std::uintptr_t>(stack_limit) + stack_size;
+        stack_base = reinterpret_cast<std::uintptr_t>(stack_limit) + np_stack_size;
+        stack_size = np_stack_size;
     } else {
         // Use an anchor as a base stack address.
         int anchor{};

@pavelkumbrasev
Copy link
Contributor

I'm not 100% sure. Perhaps, on external thread initialization it should check if the default stack size is equal to the current thread's stack size and if not it should safe this value into thread_data.
Perhaps, here:
https://github.com/oneapi-src/oneTBB/blob/2d516c871b79d81032f0ca44d76e5072bc62d479/src/tbb/governor.cpp#L222
And later check it during get_stack_base calculation.

@bongkyu7-kim
Copy link
Contributor Author

@pavelkumbrasev,
get_stack_base() called by the first thread only. The other external threads are created with explicit stack size - worker_stack_size() and calculate stealing threshold by another arena::calculate_stealing_threshold() function.
Because stack_size is used only for calculate_stealing_threshold, I think we don't need save stack into thread_data.
So, the above suggested code is enough to fixing the problem. How about this?

@pavelkumbrasev
Copy link
Contributor

governor::init_external_thread() will be called for each external thread that joins TBB. Therefore, stack size calculation and calculate_stealing_threshold() will be computed for each joining thread.
So for external threads instead of calling stack_size = a.my_threading_control->worker_stack_size(); we can check actual thread stack size.

@bongkyu7-kim
Copy link
Contributor Author

With my patch(#1389 (comment)), the actual thread stack size checked inside get_stack_base() and pass to stack_size. So, I think it's enough to fix the problem.

stack_size = a.my_market->worker_stack_size();
std::uintptr_t stack_base = get_stack_base(stack_size);  --> stack_size updated
task_dispatcher& task_disp = td.my_arena_slot->default_task_dispatcher();
td.enter_task_dispatcher(task_disp, calculate_stealing_threshold(stack_base, stack_size));

@pavelkumbrasev
Copy link
Contributor

How will stack_size be updated if it is passed by a copy?

@bongkyu7-kim
Copy link
Contributor Author

I changed to call by reference. It can be updated.

-static std::uintptr_t get_stack_base(std::size_t stack_size) {
+static std::uintptr_t get_stack_base(std::size_t &stack_size) {

  •    stack_size = np_stack_size;
    

@pavelkumbrasev
Copy link
Contributor

pavelkumbrasev commented Aug 22, 2024

I'm not found of such an implicit changes.
Why not to replace:
stack_size = a.my_market->worker_stack_size();
With explicit:
stack_size = get_stack_size();

So this will be applicable for the rest of the OS.

@bongkyu7-kim
Copy link
Contributor Author

Sure, I will change like your suggestion.

@bongkyu7-kim
Copy link
Contributor Author

@pavelkumbrasev,
I opened new PR. Please review this.
#1485

@bongkyu7-kim bongkyu7-kim mentioned this pull request Sep 28, 2024
13 tasks
@bongkyu7-kim
Copy link
Contributor Author

Move to PR #1485. Close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants