Add a dedicated method for disconnecting TLS connections #10005

julianbrost · 2024-02-16T16:40:28Z

Properly closing a TLS connection involves sending some shutdown messages so that both ends know that the connection wasn't truncated maliciously. Exchanging those messages can stall for a long time if the underlying TCP connection is broken. The HTTP connection handling was missing any kind of timeout for the TLS shutdown so that dead connections could hang around for a long time.

This PR introduces two new methods on AsioTlsStream, namely ForceDisconnect() which just wraps the call for closing the TCP connection and GracefulShutdown() which performs the TLS shutdown with a timeout similar to it was done in JsonRpcConnection::Disconnect() before.

As the lambda passed to Timeout has to keep the connection object alive, AsioTlsStream is changed to inherit from SharedObject, adding reference counting to it directly. Previously, it was already created as Shared<AsioTlsStream> everywhere. Thus, a good part of the first commit is changing that type across multiple files.

fixes #9986

Edit (@Al2Klimov)

closes #10216
closes #10219
closes #10220

julianbrost · 2024-11-07T13:19:27Z

Open questions:

Do we want to try to call ForceDisconnect() directly in case a connection is shut down due to a timeout like "no messages received" on JSON-RPC connections?

By now, I'm pretty sure the answer is yes.

Evidence: Take two connected Icinga 2 nodes and break individual connections by dropping the packets specific to that connection in a firewall. Both nodes will detect that no messages were received and reconnect, however, the old connection remains in an established state:

root@satellite-b-2:/# netstat -tpn | grep 172.18.0.33
tcp6       0      0 172.18.0.31:5665        172.18.0.33:49584       ESTABLISHED 60/icinga2          
root@satellite-b-2:/# iptables -A INPUT -p tcp -s 172.18.0.33 --sport 49584 -j DROP
root@satellite-b-2:/# netstat -tpn | grep 172.18.0.33
tcp6       0      0 172.18.0.31:5665        172.18.0.33:49098       ESTABLISHED 60/icinga2          
tcp6       0  82558 172.18.0.31:5665        172.18.0.33:49584       ESTABLISHED 60/icinga2          
root@satellite-b-2:/# iptables -A INPUT -p tcp -s 172.18.0.33 --sport 49098 -j DROP
root@satellite-b-2:/# netstat -tpn | grep 172.18.0.33
tcp6       0  94323 172.18.0.31:5665        172.18.0.33:49098       ESTABLISHED 60/icinga2          
tcp6       0  82558 172.18.0.31:5665        172.18.0.33:49584       ESTABLISHED 60/icinga2          
tcp6       0      0 172.18.0.31:5665        172.18.0.33:39946       ESTABLISHED 60/icinga2

That implies that there's probably a resource leak in that scenario (until the kernel decides that the connection is actually dead and returns an error for the socket operations).

Unverified theory of what might happen: JsonRpcConnection::Disconnect() could block here:

icinga2/lib/remote/jsonrpcconnection.cpp

Line 226 in 9a8620d

m_WriterDone.Wait(yc);

Which waits for JsonRpcConnection::WriteOutgoingMessages() to complete which might hang indefinitely in these send operations:

icinga2/lib/remote/jsonrpcconnection.cpp

Lines 112 to 120 in 9a8620d

    
           for (auto& message : queue) { 
        
           	size_t bytesSent = JsonRpc::SendRawMessage(m_Stream, message, yc); 
        
           	if (m_Endpoint) { 
        
           		m_Endpoint->AddMessageSent(bytesSent); 
        
           	} 
        
           } 
        
           m_Stream->async_flush(yc);

Al2Klimov · 2024-11-07T14:08:56Z

Thing to consider

JsonRpcConnection: Don't drop client from cache prematurely #10210 (comment)
- TL;DR: just close dead connections, probably via destructor (shouldn't need much code)

julianbrost · 2024-11-07T16:34:16Z

I resolved conflicts and started implementing JsonRpcConnection::ForceDisconnect(), still far from finished, but if anyone is eager to have a look, feel free.

All usages of `AsioTlsStream` were already using `Shared<AsioTlsStream>` to keep a reference-counted instance. This commit moves the reference counting to `AsioTlsStream` itself by inheriting from `SharedObject`. This will allow to implement methods making use of the fact that these objects are reference-counted. The changes outside of `lib/base/tlsstream.hpp` are merely replacing `Shared<AsioTlsStream>::Ptr` with `AsioTlsStream::Ptr` everywhere.

julianbrost · 2024-11-13T16:47:16Z

I resolved conflicts and started implementing JsonRpcConnection::ForceDisconnect(), still far from finished

While continuing that work, I figured that this might become a bigger rework of JsonRpcConnection than I anticipated. So I decided that this is better done in a separate PR that I'll create tomorrow.

This PR on it's own should already be enough of an improvement on its own, after all it even fixes a problem in the HTTP connection handling.

Open questions:

I'm not yet sure about the last two commits (d6953cc, e90acc5): they basically replace two calls of async_shutdown() with calls to the new GracefulShutdown(). In both instances, those calls are already guarded by higher-level timeouts (apilistener.cpp, apilistener.cpp, ifwapichecktask.cpp), the idea for replacing them would be to have just a single call to async_shutdown() in our code base that's properly guarded.

In that regard, I figured that adding comments to these calls why they are fine is good enough, especially when compared that was needed just to add a redundant timeout and spawn a pointless coroutine (e90acc5).

I'm leaving the PR in a draft state for the moment because I still want to answer a few detail questions regarding the two new disconnect methods (I removed a cancel() because I didn't see a good reason for it to exist, I wonder if shutdown() is even necessary after a successful async_shutdown() and once I'm sure about all this, some doc comments explaining the exact behavior certainly won't hurt).

julianbrost · 2024-11-14T17:11:21Z

I removed a cancel() because I didn't see a good reason for it to exist

I still couldn't find a good reason to call lowest_layer().cancel() before next_layer().async_shutdown(), which is a bit of a problem if I want to add comments explaining what the code does. Like what would we want to cancel on the lowest, i.e. TCP layer? And if something was actually cancelled, how much sense would it make to continue using the TCP connection? Would it be possible to actually cancel something in the middle of writing a TLS record and what would continuing with a new TLS record during the shutdown do?

I wonder if shutdown() is even necessary after a successful async_shutdown()

That on the other hand turned out to be necessary: as confirmed using strace, without calling lowest_layer().shutdown(), there would otherwise be no shutdown() syscall.

yhabteab · 2024-11-15T09:12:26Z

lib/base/tlsstream.cpp

+	Timeout::Ptr shutdownTimeout(new Timeout(strand.context(), strand, boost::posix_time::seconds(10),
+		[this, keepAlive = AsioTlsStream::Ptr(this)](boost::asio::yield_context yc) {
+			// Forcefully terminate the connection if async_shutdown() blocked more than 10 seconds.
+			ForceDisconnect();


Why not use lowest_layer().cancel() here as before? Shouldn't this call force async_shutdown to return? Right now, if you force disconnect here, which basically means you literally destroy the socket file descriptor, the call to shutdown() on the underlying TCP layer becomes pointless since the socket no longer exists at that point. I know we suppress the errors, but shutdown() would return a bad file descriptor error in this case, which I don't think is right.

My basic idea here was that ForceDisconnect() should be safe to call at any time, including when this timeout happens. And thus call it here, because it does exactly what is needed here.

Why not use lowest_layer().cancel() here as before? Shouldn't this call force async_shutdown to return?

For two reasons: the close() used in ForceDisconnect() cancels these operations as well (see docs):

Any asynchronous send, receive or connect operations will be cancelled immediately, and will complete with the boost::asio::error::operation_aborted error.

And second, if the socket is closed, it prevents accidentally starting new operations.

Right now, if you force disconnect here, which basically means you literally destroy the socket file descriptor, the call to shutdown() on the underlying TCP layer becomes pointless since the socket no longer exists at that point. I know we suppress the errors, but shutdown() would return a bad file descriptor error in this case, which I don't think is right.

Yes, that error can be returned, however, it should come from Asio, not from the underlying syscall as Asio internally checks the socket state before issuing the syscall:

https://github.com/boostorg/asio/blob/boost-1.86.0/include/boost/asio/detail/impl/socket_ops.ipp#L521-L525

Adding a few calls to is_open() should not hurt though, that would hopefully make this a bit more obvious.

And second, if the socket is closed, it prevents accidentally starting new operations.

Even then, shouldn't the last ForceDisconnect() call be enough?

Adding a few calls to is_open() should not hurt though, that would hopefully make this a bit more obvious.

Instead of adding such conditions, I would simply use cancel() here as before, as you are going to destroy the socket at the end of the coroutine anyway.

Ah, the discussion you had in #10005 (comment) is also relevant here. The check in the code I linked works because close() sets the internal file descriptor to -1 (which is the value of invalid_socket).

Al2Klimov · 2024-11-15T09:37:42Z

lib/remote/httpserverconnection.cpp

-
-			m_Stream->next_layer().async_shutdown(yc[ec]);
-
-			m_Stream->lowest_layer().shutdown(m_Stream->lowest_layer().shutdown_both, ec);


So ec isn't needed anymore, right?

Al2Klimov · 2024-11-15T09:40:48Z

lib/base/tlsstream.hpp

+	}
+
+	void ForceDisconnect();
+	void GracefulDisconnect(boost::asio::io_context::strand strand, boost::asio::yield_context yc);


Suggested change

void GracefulDisconnect(boost::asio::io_context::strand strand, boost::asio::yield_context yc);

void GracefulDisconnect(boost::asio::io_context::strand& strand, boost::asio::yield_context yc);

The copy constructor of strand say this :):

Meaning, it's just like a shared pointer. So, your suggestion is not wrong, but neither is the current implementation.

Actually, using a reference here wouldn't work because that method spawns a coroutine, and there's nothing that keeps the m_IoStrand object from the Http or Rpc class alive till that coroutine finishes.

lib/base/tlsstream.cpp

Calling `AsioTlsStream::async_shutdown()` performs a TLS shutdown which exchanges messages (that's why it takes a `yield_context`) and thus has the potential to block the coroutine. Therefore, it should be protected with a timeout. As `async_shutdown()` doesn't simply take a timeout, this has to be implemented using a timer. So far, these timers are scattered throughout the codebase with some places missing them entirely. This commit adds helper functions to properly shutdown a TLS connection with a single function call.

This new helper functions allows deduplicating the timeout handling for `async_shutdown()`.

This new helper function has proper timeout handling which was missing here.

The reason for introducing AsioTlsStream::GracefulDisconnect() was to handle the TLS shutdown properly with a timeout since it involves a timeout. However, the implementation of this timeout involves spwaning coroutines which are redundant in some cases. This commit adds comments to the remaining calls of async_shutdown() stating why calling it is safe in these places.

cla-bot bot added the cla/signed label Feb 16, 2024

icinga-probot bot added area/api REST API bug Something isn't working ref/IP labels Feb 16, 2024

julianbrost mentioned this pull request Nov 6, 2024

JsonRpcConnection: Don't drop client from cache prematurely #10210

Open

Al2Klimov mentioned this pull request Nov 7, 2024

JsonRpcConnection#Disconnect(): wait for #WriteOutgoingMessages() after cancelling I/O #10219

Open

julianbrost force-pushed the graceful-tls-disconnect branch from e90acc5 to 3a72a6f Compare November 7, 2024 16:32

julianbrost force-pushed the graceful-tls-disconnect branch from 3a72a6f to 9d67c26 Compare November 13, 2024 16:35

julianbrost force-pushed the graceful-tls-disconnect branch from 9d67c26 to e5b9ff4 Compare November 14, 2024 16:20

julianbrost force-pushed the graceful-tls-disconnect branch from e5b9ff4 to 7430618 Compare November 14, 2024 17:19

julianbrost marked this pull request as ready for review November 14, 2024 17:19

yhabteab reviewed Nov 15, 2024

View reviewed changes

Al2Klimov reviewed Nov 15, 2024

View reviewed changes

Al2Klimov assigned julianbrost Nov 15, 2024

julianbrost added 4 commits November 18, 2024 13:45

JsonRpcConnection: use AsioTlsStream::GracefulDisconnect()

a5bdf23

This new helper functions allows deduplicating the timeout handling for `async_shutdown()`.

HttpServerConnection: use AsioTlsStream::GracefulDisconnect()

937b4bc

This new helper function has proper timeout handling which was missing here.

julianbrost force-pushed the graceful-tls-disconnect branch from 7430618 to a6445c7 Compare November 18, 2024 13:15

julianbrost requested review from yhabteab and Al2Klimov November 18, 2024 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a dedicated method for disconnecting TLS connections #10005

Add a dedicated method for disconnecting TLS connections #10005

julianbrost commented Feb 16, 2024 •

edited by Al2Klimov

Loading

julianbrost commented Nov 7, 2024

Al2Klimov commented Nov 7, 2024

julianbrost commented Nov 7, 2024

julianbrost commented Nov 13, 2024

julianbrost commented Nov 14, 2024

yhabteab Nov 15, 2024

julianbrost Nov 18, 2024

yhabteab Nov 18, 2024

julianbrost Nov 18, 2024

Al2Klimov Nov 15, 2024

Al2Klimov Nov 15, 2024

yhabteab Nov 15, 2024

yhabteab Nov 15, 2024


		m_Stream->next_layer().async_shutdown(yc[ec]);

		m_Stream->lowest_layer().shutdown(m_Stream->lowest_layer().shutdown_both, ec);

	void GracefulDisconnect(boost::asio::io_context::strand strand, boost::asio::yield_context yc);
	void GracefulDisconnect(boost::asio::io_context::strand& strand, boost::asio::yield_context yc);

Add a dedicated method for disconnecting TLS connections #10005

Are you sure you want to change the base?

Add a dedicated method for disconnecting TLS connections #10005

Conversation

julianbrost commented Feb 16, 2024 • edited by Al2Klimov Loading

Edit (@Al2Klimov)

julianbrost commented Nov 7, 2024

Al2Klimov commented Nov 7, 2024

Thing to consider

julianbrost commented Nov 7, 2024

julianbrost commented Nov 13, 2024

julianbrost commented Nov 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julianbrost commented Feb 16, 2024 •

edited by Al2Klimov

Loading