app_chanspy+cel: Release channel iterator before chanspying #114

wdoekes · 2023-05-24T12:03:19Z

(This PR contains 4 commits, where the first three are really just refactoring. Only the last should change behaviour.)

Refactor channel spying so it never holds on to a channel iterator. Instead, we recreate the iterator when needed and skip channels that we've seen already, creating the illusion of using an iterator.

This change was needed because the iterator caused the yet-unseen channels in the iterator to be referenced by the iterator. This reference ensured that the channel does not get destroyed. (Which is good, because the iterator needs valid channels to work on.)

But, hanging on to a channel reference for longer than a short while conflicts with CEL logging. The CEL hangup logging is activated by the destruction of the channel. During chanspy activity, a bunch of channels would stay in limbo. First when the chanspying was done would those channels get their CEL hangup event logged.

The fix here is to hang on to channel iterators for only a short while. An alternative fix that makes CEL hangup logging independent of channel destruction was deemed more invasive.

This patch makes chanspy channel selection slightly more resource intensive. But that's a small price to pay for correct CEL hangup logging.

Fixes: #68

Because of upcoming planned changes/refactoring to app_chanspy, it is convenient to pass some of the many arguments as a struct. This changeset adds a channel_spy_context struct to pass around. Related: asterisk#68

…tion This moves the guts of common_exec into channel_spy_consume_iterator. This makes refactoring/changing the code easier because there are fewer function local variables to consider. Related: asterisk#68

Refactor channel spying so it never holds on to a channel iterator. Instead, we recreate the iterator when needed and skip channels that we've seen already, creating the illusion of using an iterator. This change was needed because the iterator caused the yet-unseen channels in the iterator to be referenced by the iterator. This reference ensured that the channel does not get destroyed. (Which is good, because the iterator needs valid channels to work on.) But, hanging on to a channel reference for longer than a short while conflicts with CEL logging. The CEL hangup logging is activated by the destruction of the channel. During chanspy activity, a bunch of channels would stay in limbo. First when the chanspying was done would those channels get their CEL hangup event logged. The fix here is to hang on to channel iterators for only a short while. An alternative fix that makes CEL hangup logging independent of channel destruction was deemed more invasive. This patch makes chanspy channel selection slightly more resource intensive. But that's a small price to pay for correct CEL hangup logging. Fixes: asterisk#68

wdoekes · 2023-05-24T14:16:01Z

cherry-pick-to: 22
cherry-pick-to: 21
cherry-pick-to: 20

wdoekes · 2023-05-25T07:21:24Z

As discussed on IRC, I want the commits applied without squashing (they all compile and run fine if merged in sequence);
AsteriskGateTestMatrix (18, pjs2) failed, but that looks unrelated.

I did not add any UserNote: stuff yet. I could add something like:

CEL hangup logging is not longer delayed by concurrent ChanSpy activity.

As for the fix itself (the last commit): I'm open to alternative solutions. I mention memory hoarding. This could be fixed by removing items from the vector that are not in the iterator. But then we'd want to switch to a linkedlist/hashmap instead.

jcolp · 2023-05-25T15:52:40Z

So I was pondering this some. ChanSpy is from a time when the only way to know about a channel was to get the channel, or list of channels, and go through them. Since then we've added channel snapshots and have a cache of them accessible using ast_channel_cache_by_name() which doesn't require accessing the channels or channel list at all. I think this should be examined as an alternative instead, so that holding references to channels is kept to a minimum - specifically when a channel is being spied on.

Anyone else have any thoughts on that idea?

wdoekes · 2023-06-30T15:19:54Z

@jcolp: Do you have a quick example of how to use those snapshots? Some bigger refactoring might be worth it.

We've been running this on prod now, but we did run into 2 similar deadlocks. I cannot prove that this patch caused it, but cannot disprove it either.

The two deadlocks occurred with two threads in:
got_optimized_out -> try_swap_optimize_out -> bridge_do_move -> bridge_complete_join -> bridge_channel_complete_join -> simple_bridge_join -> ast_channel_request_stream_topology_change -> simple_bridge_join -> ast_channel_request_stream_topology_change -> unreal_colp_stream_topology_request_change -> ast_unreal_lock_all
According to the core dumps we collected, they were stuck at ast_channel_lock_both(p->chan, p->owner). Both threads were supposedly already owner of one of the locks. Either the core dump misinformed, or that shouldn't happen. (Especially the thread that was holding on to p->owner but not p->chan, because it causes a locking inversion.)

If this changeset is to blame, I would put my money on some ast_autochan behaviour (swapping (locked?) channels?), but I don't really see how I changed anything in this respect... especially since there was only one ChanSpy invocation in one of these Asterisk runs, and it was 13 hours prior to the deadlock.

jcolp · 2023-06-30T15:31:15Z

I'm not sure what example would be applicable for this exactly, but "core show channels" uses the channel snapshot cache and snapshots. It doesn't pull from the channels container.

sangoma-oss-cla · 2024-02-22T18:49:03Z

All committers have signed the CLA.

app_chanspy: Refactor arguments to allow splitting large functions

f7da07b

Because of upcoming planned changes/refactoring to app_chanspy, it is convenient to pass some of the many arguments as a struct. This changeset adds a channel_spy_context struct to pass around. Related: asterisk#68

github-actions bot added the testing-in-progress label May 24, 2023

asterisk-org-access-app bot requested a review from a team May 24, 2023 12:13

asterisk-org-access-app bot added the test-checks-passed label May 24, 2023

app_chanspy: Refactor common_exec, moving channel iteration into func…

62e7404

…tion This moves the guts of common_exec into channel_spy_consume_iterator. This makes refactoring/changing the code easier because there are fewer function local variables to consider. Related: asterisk#68

github-actions bot added test-gates-failed and removed test-checks-passed testing-in-progress labels May 24, 2023

asterisk-org-access-app bot added the test-checks-passed label May 24, 2023

wdoekes added 2 commits May 24, 2023 14:55

app_chanspy: Skip the rest of the group iteration after first match

9ef7767

github-actions bot added testing-in-progress and removed test-checks-passed test-gates-failed labels May 24, 2023

asterisk-org-access-app bot added the test-checks-passed label May 24, 2023

wdoekes changed the title ~~app_chanspy: Refactor arguments to allow splitting large functions~~ app_chanspy+cel: Release channel iterator before chanspying May 24, 2023

wdoekes marked this pull request as draft May 24, 2023 14:00

github-actions bot added test-gates-failed and removed testing-in-progress labels May 24, 2023

wdoekes marked this pull request as ready for review May 24, 2023 14:15

gtjoseph added the cherry-pick-test Trigger dry run of cherry-picks label May 24, 2023

github-actions bot added cherry-pick-testing-in-progress Cherry-Pick tests in progress cherry-pick-checks-failed Cherry-Pick checks failed and removed cherry-pick-test Trigger dry run of cherry-picks cherry-pick-testing-in-progress Cherry-Pick tests in progress labels May 24, 2023

gtjoseph added the cherry-pick-test Trigger dry run of cherry-picks label May 24, 2023

github-actions bot added cherry-pick-testing-in-progress Cherry-Pick tests in progress and removed cherry-pick-test Trigger dry run of cherry-picks cherry-pick-checks-failed Cherry-Pick checks failed labels May 24, 2023

github-actions bot added cherry-pick-checks-passed Cherry-Pick checks passed cherry-pick-gates-failed Cherry-Pick gates failed and removed cherry-pick-testing-in-progress Cherry-Pick tests in progress labels May 24, 2023

gtjoseph force-pushed the master branch 4 times, most recently from 1b894e6 to 32fd0fb Compare June 27, 2023 16:21

gtjoseph force-pushed the master branch 7 times, most recently from b15287c to 1862a36 Compare September 5, 2023 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app_chanspy+cel: Release channel iterator before chanspying #114

app_chanspy+cel: Release channel iterator before chanspying #114

wdoekes commented May 24, 2023 •

edited

Loading

wdoekes commented May 24, 2023 •

edited by jcolp

Loading

wdoekes commented May 25, 2023 •

edited

Loading

jcolp commented May 25, 2023

wdoekes commented Jun 30, 2023

jcolp commented Jun 30, 2023

sangoma-oss-cla bot commented Feb 22, 2024 •

edited

Loading

app_chanspy+cel: Release channel iterator before chanspying #114

Are you sure you want to change the base?

app_chanspy+cel: Release channel iterator before chanspying #114

Conversation

wdoekes commented May 24, 2023 • edited Loading

wdoekes commented May 24, 2023 • edited by jcolp Loading

wdoekes commented May 25, 2023 • edited Loading

jcolp commented May 25, 2023

wdoekes commented Jun 30, 2023

jcolp commented Jun 30, 2023

sangoma-oss-cla bot commented Feb 22, 2024 • edited Loading

wdoekes commented May 24, 2023 •

edited

Loading

wdoekes commented May 24, 2023 •

edited by jcolp

Loading

wdoekes commented May 25, 2023 •

edited

Loading

sangoma-oss-cla bot commented Feb 22, 2024 •

edited

Loading