Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

app_chanspy+cel: Release channel iterator before chanspying #114

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

wdoekes
Copy link
Contributor

@wdoekes wdoekes commented May 24, 2023

(This PR contains 4 commits, where the first three are really just refactoring. Only the last should change behaviour.)

Refactor channel spying so it never holds on to a channel iterator. Instead, we recreate the iterator when needed and skip channels that we've seen already, creating the illusion of using an iterator.

This change was needed because the iterator caused the yet-unseen channels in the iterator to be referenced by the iterator. This reference ensured that the channel does not get destroyed. (Which is good, because the iterator needs valid channels to work on.)

But, hanging on to a channel reference for longer than a short while conflicts with CEL logging. The CEL hangup logging is activated by the destruction of the channel. During chanspy activity, a bunch of channels would stay in limbo. First when the chanspying was done would those channels get their CEL hangup event logged.

The fix here is to hang on to channel iterators for only a short while. An alternative fix that makes CEL hangup logging independent of channel destruction was deemed more invasive.

This patch makes chanspy channel selection slightly more resource intensive. But that's a small price to pay for correct CEL hangup logging.

Fixes: #68

Because of upcoming planned changes/refactoring to app_chanspy, it is
convenient to pass some of the many arguments as a struct. This
changeset adds a channel_spy_context struct to pass around.

Related: asterisk#68
…tion

This moves the guts of common_exec into channel_spy_consume_iterator.
This makes refactoring/changing the code easier because there are fewer
function local variables to consider.

Related: asterisk#68
Refactor channel spying so it never holds on to a channel iterator.
Instead, we recreate the iterator when needed and skip channels that
we've seen already, creating the illusion of using an iterator.

This change was needed because the iterator caused the yet-unseen
channels in the iterator to be referenced by the iterator. This
reference ensured that the channel does not get destroyed. (Which is
good, because the iterator needs valid channels to work on.)

But, hanging on to a channel reference for longer than a short while
conflicts with CEL logging. The CEL hangup logging is activated by the
destruction of the channel. During chanspy activity, a bunch of channels
would stay in limbo. First when the chanspying was done would those
channels get their CEL hangup event logged.

The fix here is to hang on to channel iterators for only a short while.
An alternative fix that makes CEL hangup logging independent of channel
destruction was deemed more invasive.

This patch makes chanspy channel selection slightly more resource
intensive. But that's a small price to pay for correct CEL hangup
logging.

Fixes: asterisk#68
@wdoekes wdoekes changed the title app_chanspy: Refactor arguments to allow splitting large functions app_chanspy+cel: Release channel iterator before chanspying May 24, 2023
@wdoekes wdoekes marked this pull request as draft May 24, 2023 14:00
@wdoekes wdoekes marked this pull request as ready for review May 24, 2023 14:15
@wdoekes
Copy link
Contributor Author

wdoekes commented May 24, 2023

cherry-pick-to: 22
cherry-pick-to: 21
cherry-pick-to: 20

@gtjoseph gtjoseph added the cherry-pick-test Trigger dry run of cherry-picks label May 24, 2023
@github-actions github-actions bot added cherry-pick-testing-in-progress Cherry-Pick tests in progress cherry-pick-checks-failed Cherry-Pick checks failed and removed cherry-pick-test Trigger dry run of cherry-picks cherry-pick-testing-in-progress Cherry-Pick tests in progress labels May 24, 2023
@gtjoseph gtjoseph added the cherry-pick-test Trigger dry run of cherry-picks label May 24, 2023
@github-actions github-actions bot added cherry-pick-testing-in-progress Cherry-Pick tests in progress and removed cherry-pick-test Trigger dry run of cherry-picks cherry-pick-checks-failed Cherry-Pick checks failed labels May 24, 2023
@github-actions github-actions bot added cherry-pick-checks-passed Cherry-Pick checks passed cherry-pick-gates-failed Cherry-Pick gates failed and removed cherry-pick-testing-in-progress Cherry-Pick tests in progress labels May 24, 2023
@wdoekes
Copy link
Contributor Author

wdoekes commented May 25, 2023

  • As discussed on IRC, I want the commits applied without squashing (they all compile and run fine if merged in sequence);
  • AsteriskGateTestMatrix (18, pjs2) failed, but that looks unrelated.
  • I did not add any UserNote: stuff yet. I could add something like:
    CEL hangup logging is not longer delayed by concurrent ChanSpy activity.
    
  • As for the fix itself (the last commit): I'm open to alternative solutions. I mention memory hoarding. This could be fixed by removing items from the vector that are not in the iterator. But then we'd want to switch to a linkedlist/hashmap instead.

@jcolp
Copy link
Member

jcolp commented May 25, 2023

So I was pondering this some. ChanSpy is from a time when the only way to know about a channel was to get the channel, or list of channels, and go through them. Since then we've added channel snapshots and have a cache of them accessible using ast_channel_cache_by_name() which doesn't require accessing the channels or channel list at all. I think this should be examined as an alternative instead, so that holding references to channels is kept to a minimum - specifically when a channel is being spied on.

Anyone else have any thoughts on that idea?

@gtjoseph gtjoseph force-pushed the master branch 4 times, most recently from 1b894e6 to 32fd0fb Compare June 27, 2023 16:21
@wdoekes
Copy link
Contributor Author

wdoekes commented Jun 30, 2023

@jcolp: Do you have a quick example of how to use those snapshots? Some bigger refactoring might be worth it.

We've been running this on prod now, but we did run into 2 similar deadlocks. I cannot prove that this patch caused it, but cannot disprove it either.

The two deadlocks occurred with two threads in:
got_optimized_out -> try_swap_optimize_out -> bridge_do_move -> bridge_complete_join -> bridge_channel_complete_join -> simple_bridge_join -> ast_channel_request_stream_topology_change -> simple_bridge_join -> ast_channel_request_stream_topology_change -> unreal_colp_stream_topology_request_change -> ast_unreal_lock_all
According to the core dumps we collected, they were stuck at ast_channel_lock_both(p->chan, p->owner). Both threads were supposedly already owner of one of the locks. Either the core dump misinformed, or that shouldn't happen. (Especially the thread that was holding on to p->owner but not p->chan, because it causes a locking inversion.)

If this changeset is to blame, I would put my money on some ast_autochan behaviour (swapping (locked?) channels?), but I don't really see how I changed anything in this respect... especially since there was only one ChanSpy invocation in one of these Asterisk runs, and it was 13 hours prior to the deadlock.

@jcolp
Copy link
Member

jcolp commented Jun 30, 2023

I'm not sure what example would be applicable for this exactly, but "core show channels" uses the channel snapshot cache and snapshots. It doesn't pull from the channels container.

@gtjoseph gtjoseph force-pushed the master branch 7 times, most recently from b15287c to 1862a36 Compare September 5, 2023 19:35
Copy link

sangoma-oss-cla bot commented Feb 22, 2024

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: Long running channel iterator (in ChanSpy) blocks CEL HANGUP events of unrelated channels
3 participants