Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix network stack stability issues #10080

Merged
merged 11 commits into from
Nov 14, 2024
Merged

Fix network stack stability issues #10080

merged 11 commits into from
Nov 14, 2024

Conversation

lippserd and others added 8 commits June 10, 2024 13:19
While analyzing a possible memory leak, we encountered several coroutine
exception messages, which unfortunately do not provide any information
about what exactly went wrong, as exception diagnostics were previously
only logged at the notice level.
Although there is locking involved here, it shoudln't take too long for
the thread to actually acquire it, since there aren't that many threads
dealing with endpoint clients concurrently. It's just wasting pointless
time trying to obtain a CPU slot.
Closing and re-opening that very same log file shouldn't reset the
counter, otherwise some log files may exceed the max limit per file as
their offset indicator is reset each time they are re-opened.
…ging

On incoming connection timeout we log the remote endpoint which isn't
available if it was already disconnected - an exception is thrown.  Get it
as long as we're still connected not to lose it, nor to get an exception.
When a HTTP connection dies prematurely while the response is sent,
`http::async_write()` sets the error code to something like broken pipe for
example. When calling `async_flush()` afterwards, it sometimes happens that
this never returns. This results in a resource leak as the coroutine isn't
cleaned up. This commit makes the individual functions throw exceptions instead
of silently ignoring the errors, resulting in the function terminating early
and also resulting in an error being logged as well.
@cla-bot cla-bot bot added the cla/signed label Jun 10, 2024
@icinga-probot icinga-probot bot added this to the 2.14.3 milestone Jun 10, 2024
@icinga-probot icinga-probot bot added area/api REST API area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working core/quality Improve code, libraries, algorithms, inline docs ref/IP labels Jun 10, 2024
@Al2Klimov Al2Klimov marked this pull request as draft June 10, 2024 12:38
@Al2Klimov Al2Klimov self-assigned this Jun 21, 2024
@Al2Klimov Al2Klimov marked this pull request as ready for review November 5, 2024 10:35
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase!

When the `Desconnect()` method is called, clients are not disconnected
immediately. Instead, a new coroutine is spawned using the same strand
as the other coroutines. This coroutine calls `async_shutdown` on the
TCP socket, which might be blocking. However, in order not to block
indefintely, the `Timeout` class cancels all operations on the socket
after `10` seconds. Though, the timeout does not trigger the handler
immediately; it creates spawns another coroutine using the same strand
as in the `JsonRpcConnection` class. This can cause unexpected delays if
e.g. `HandleIncomingMessages` gets resumed before the coroutine from the
timeout class. Apart from that, the coroutine for writing messages uses
the same condition, making the two symmetrical.
@yhabteab yhabteab merged commit d5cd5af into support/2.14 Nov 14, 2024
27 checks passed
@yhabteab yhabteab deleted the net-stack-2.14.3 branch November 14, 2024 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working cla/signed core/quality Improve code, libraries, algorithms, inline docs ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants