Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rare termination logic failures that could result in early shutdown #4556

Merged
merged 6 commits into from
Dec 4, 2024

Conversation

dipinhora
Copy link
Contributor

Prior to this commit, there was a very rare edge case in the termination logic that could result in early shutdown resulting in a segfault.

This commit simplifies and reworks the shutdown/termination logic in order to make it more robust with less edge cases.

The logic now:

  • does not un-noisy an actor from the ASIO thread until the relevant ASIO event is destroyed instead of when it is unsubscribed. This is important because the ASIO subsystem still has a reference to the actor and can send a message to it until the ASIO event is destroyed even if it has been unsubscribed
  • always runs the CNF/ACK protocol to all schedulers instead of only the active ones
  • disables scheduler scaling to ensure all schedulers are active for the duration of the termination CNF/ACK protocol to avoid / minimize complexity from schedulers suspending during the termination process
  • ensures the local scheduler tracking of ASIO noisiness is more accurate and robust to messages being received out of order

@ponylang-main ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Dec 1, 2024
@dipinhora
Copy link
Contributor Author

musl builds are unhappy.. the backtrace in the logs isn't helpful...

i'll try and set up a VM to reproduce locally at some point (probably not for a day or two) but not sure what the issue is since everything else is happy..

@dipinhora
Copy link
Contributor Author

i'm able to reproduce the crash..

some info:

  • it happens randomly.. sometimes not at all.. sometimes multiple failures..
  • it happens on main also but..
  • the changes in this PR make it happen more often..
  • it doesn't seem to happen at all if any individual failed binary is run in a loop (with or without lldb)..
  • it doesn't seem to happen if usedebugger=lldb is not used..
  • it happens even if --ponymaxthreads=1 is passed to full-program-runner..
  • it doesn't seem to happen if "--max_parallel=1" is passed to full-program-runner instead of "--max_parallel=nproc --all"

based on the above, my current guess is there's some sort of interaction between the full-program-runner and the binary being run when lldb is involved that sometimes causes the segfault and that this issue has existed prior to this PR and is unrelated to the changes in this PR.. possibly some edge case/race condition in relation to how child processes are spawned/managed by full-program-runner..

@ponylang/core any thoughts on how to proceed?

@SeanTAllen
Copy link
Member

Without any evidence to support my statements, this feels like a memory clobbering. And it also feels like a musl bug.

@SeanTAllen
Copy link
Member

SeanTAllen commented Dec 3, 2024

@dipinhora can you use the 20241203 musl image i just pushed and see if there is any difference? it should have a new musl that fixed "multiple race conditions". Worth a try!

NEVERMIND. LLVM isn't building.

I'm working on getting us able to use that newer musl.

@SeanTAllen
Copy link
Member

@dipinhora I'll let you know when the new builder is ready to go and try with this.

@SeanTAllen
Copy link
Member

@dipinhora you can rebase against main and get the new builder using the latest musl release.

@SeanTAllen SeanTAllen added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Dec 4, 2024
@ponylang-main
Copy link
Contributor

Hi @dipinhora,

The changelog - fixed label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do.

Release notes are added by creating a uniquely named file in the .release-notes directory. We suggest you call the file 4556.md to match the number of this pull request.

The basic format of the release notes (using markdown) should be:

## Title

End user description of changes, why it's important,
problems it solves etc.

If a breaking change, make sure to include 1 or more
examples what code would look like prior to this change
and how to update it to work after this change.

Thanks.

@SeanTAllen SeanTAllen changed the title Simplify shutdown/termination logic Fix rare termination logic failures that could result in early shutdown Dec 4, 2024
Prior to this commit, there was a very rare edge case in the
termination logic that could result in early shutdown resulting
in a segfault.

This commit simplifies and reworks the shutdown/termination logic
in order to make it more robust with less edge cases.

The logic now:

* does not un-noisy an actor from the ASIO thread until the
  relevant ASIO event is destroyed instead of when it is
  unsubscribed. This is important because the ASIO subsystem
  still has a reference to the actor and can send a message to it
  until the ASIO event is destroyed even if it has been
  unsubscribed
* always runs the CNF/ACK protocol to all schedulers instead of
  only the active ones
* disables scheduler scaling to ensure all schedulers are active
  for the duration of the termination CNF/ACK protocol to avoid /
  minimize complexity from schedulers suspending during the
  termination process
* ensures the local scheduler tracking of ASIO noisiness is more
  accurate and robust to messages being received out of order
@SeanTAllen
Copy link
Member

@dipinhora am i correct that the new builder with new musl didn't address?

@dipinhora
Copy link
Contributor Author

@dipinhora am i correct that the new builder with new musl didn't address?

no, but it gave more info in the backtrace.. it's a race condition on program startup between pthread_create and pthread_kill when a program almost immediately goes into the shutdown CNF/ACK logic and tries signalling sleeping threads before phread_create has a chance to write the thread_id to scheduler[i].tid...

i'm working on a fix..

@SeanTAllen
Copy link
Member

@dipinhora am i correct that the new builder with new musl didn't address?

no, but it gave more info in the backtrace.. it's a race condition on program startup between pthread_create and pthread_kill when a program almost immediately goes into the shutdown CNF/ACK logic and tries signalling sleeping threads before phread_create has a chance to write the thread_id to scheduler[i].tid...

i'm working on a fix..

Awesome.

@dipinhora
Copy link
Contributor Author

@dipinhora am i correct that the new builder with new musl didn't address?

no, but it gave more info in the backtrace.. it's a race condition on program startup between pthread_create and pthread_kill when a program almost immediately goes into the shutdown CNF/ACK logic and tries signalling sleeping threads before phread_create has a chance to write the thread_id to scheduler[i].tid...
i'm working on a fix..

Awesome.

@SeanTAllen fix pushed.. release notes added.. i believe the musl related issue is addressed now and everything is ready to merge (assuming no PR review related changes)..

@SeanTAllen
Copy link
Member

Awesome work @dipinhora. Thanks.

@SeanTAllen SeanTAllen merged commit 60722ad into ponylang:main Dec 4, 2024
21 checks passed
@ponylang-main ponylang-main removed the discuss during sync Should be discussed during an upcoming sync label Dec 4, 2024
github-actions bot pushed a commit that referenced this pull request Dec 4, 2024
github-actions bot pushed a commit that referenced this pull request Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants