When the network is abnormally disconnected, the connection will not be restored, and the access will generate a timeout error #1782

Adam-Jin · 2021-06-26T08:15:59Z

I have some microservices hosted in a three-node docker swarm cluster, and they rely on a single-node redis instance under the same custom network.

Under normal circumstances, everything is fine. When I perform a swarm cluster test, shut down the docker daemon of the node where the redis instance is located or simply shut down that machine, a new redis instance will be created on other machines quickly, and the service will access this new redis instance through the domain name. It's no problem.

But when I change a test method and use "systemctl stop network" to directly shut down the network service of the node where the redis instance is located, even if a new redis is created on another machine, the service will never be connected to redis. A timeout error occurred. under these circumstances. I tried to use the redis client cli to connect to redis in the service container. Everything is normal again. I can set or get the key value correctly. This proves that the redis connection timeout is not caused by the swarm network.

At the same time, I tried to set keepAlive in the connection string and upgrade niget to the latest version, but these attempts did not work.

I have also noticed this problem. This one should have encountered a similar problem with me.
Do you have any suggestions?

Redis: 6.0.5
StackExchange.Redis: 2.2.50
RedisConnectionString: console-net_redis:6379,password=123456,ssl=False,abortConnect=False,keepAlive=30
Exception:

[[bis]] [2021-06-26 06:28:55.409] [err] [production] [apigateway] [491241383e034d3ab3ea70a5fc724e4d] Connection id "0HM9ODBMBJ9PU", Request id "0HM9ODBMBJ9PU:00000014": An unhandled exception was thrown by the application. StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms), next: PING, inst: 0, qu: 0, qs: 19, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: console-net_redis:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: 2c3f5c0383c1, IOCP: (Busy=0,Free=1000,Min=200,Max=1000), WORKER: (Busy=1,Free=32766,Min=200,Max=32767), v: 2.1.30.38891 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor1 processor, ServerEndPoint server) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 2624 at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor1 processor, ServerEndPoint server) in //src/StackExchange.Redis/RedisBase.cs:line 54
at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key, CommandFlags flags) in //src/StackExchange.Redis/RedisDatabase.cs:line 2374
at Encoo.Console.ClusterServices.ApiGateway.Caching.CacheManager1.Get(String key, String region) in /src/ClusterServices/Gateways/ApiGateway/Caching/CacheManager.cs:line 59 at Encoo.Console.ClusterServices.ApiGateway.IdentityUserProvider.GetUserAsync(HttpContext context, CancellationToken token) in /src/ClusterServices/Gateways/ApiGateway/Services/IdentityUserProvider.cs:line 35 at Encoo.Console.ClusterServices.ApiGateway.AuthorizeHeadersMiddleware.InvokeAsync(HttpContext context) in /src/ClusterServices/Gateways/ApiGateway/Middleware/AuthorizeHeadersMiddleware.cs:line 29 at Microsoft.AspNetCore.MiddlewareAnalysis.AnalysisMiddleware.Invoke(HttpContext httpContext) at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.Invoke(HttpContext context) at Microsoft.AspNetCore.MiddlewareAnalysis.AnalysisMiddleware.Invoke(HttpContext httpContext) at Microsoft.AspNetCore.MiddlewareAnalysis.AnalysisMiddleware.Invoke(HttpContext httpContext) at Microsoft.AspNetCore.MiddlewareAnalysis.AnalysisMiddleware.Invoke(HttpContext httpContext) at Encoo.Console.Framework.Middleware.TraceMiddleware.InvokeAsync(HttpContext context) in /src/Common/Framework/ServiceCore/Middleware/TraceMiddleware.cs:line 22 at Microsoft.AspNetCore.MiddlewareAnalysis.AnalysisMiddleware.Invoke(HttpContext httpContext) at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication1 application)

The text was updated successfully, but these errors were encountered:

NickCraver · 2021-06-26T13:48:53Z

Can you please try the 2.2.50 release? I see in your report version 2.2.50, but in the error message it shows the running version is 2.1.30. We have several connection fixes for exactly this sort of situation to reconnect faster and more reliably in 2.2.50.

Adam-Jin · 2021-06-28T02:33:53Z

Sorry, I actually tried the latest version of NuGet package and it still doesn't work. The logs will be posted later

Adam-Jin · 2021-06-28T10:46:52Z

[[bis]] [2021-06-28 02:54:12.038] [err] [production] [apigateway] [314e94a273614bc8a0b899a6c7361e64] Connection id "0HM9PRUQIPGJN", Request id "0HM9PRUQIPGJN:0000000D": An unhandled exception was thrown by the application. StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms), next: PING, inst: 0, qu: 0, qs: 3, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: console-net_redis:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: c229e37f06c7, IOCP: (Busy=0,Free=1000,Min=200,Max=1000), WORKER: (Busy=1,Free=32766,Min=200,Max=32767), v: 2.2.50.36290 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor1 processor, ServerEndPoint server) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 2848 at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor1 processor, ServerEndPoint server) in //src/StackExchange.Redis/RedisBase.cs:line 54
at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key, CommandFlags flags) in //src/StackExchange.Redis/RedisDatabase.cs:line 2409
at Encoo.Console.ClusterServices.ApiGateway.Caching.CacheManager1.Get(String key, String region) in /src/ClusterServices/Gateways/ApiGateway/Caching/CacheManager.cs:line 55 at Encoo.Console.ClusterServices.ApiGateway.AuthorizeMiddleware.InvokeAsync(HttpContext context) in /src/ClusterServices/Gateways/ApiGateway/Middleware/AuthorizeMiddleware.cs:line 72 at Microsoft.AspNetCore.MiddlewareAnalysis.AnalysisMiddleware.Invoke(HttpContext httpContext) at Encoo.Console.ClusterServices.ApiGateway.AuthorizeHeadersMiddleware.InvokeAsync(HttpContext context) in /src/ClusterServices/Gateways/ApiGateway/Middleware/AuthorizeHeadersMiddleware.cs:line 97 at Microsoft.AspNetCore.MiddlewareAnalysis.AnalysisMiddleware.Invoke(HttpContext httpContext) at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.Invoke(HttpContext context) at Microsoft.AspNetCore.MiddlewareAnalysis.AnalysisMiddleware.Invoke(HttpContext httpContext) at Microsoft.AspNetCore.MiddlewareAnalysis.AnalysisMiddleware.Invoke(HttpContext httpContext) at Microsoft.AspNetCore.MiddlewareAnalysis.AnalysisMiddleware.Invoke(HttpContext httpContext) at Encoo.Console.Framework.Middleware.TraceMiddleware.InvokeAsync(HttpContext context) in /src/Common/Framework/ServiceCore/Middleware/TraceMiddleware.cs:line 21 at Microsoft.AspNetCore.MiddlewareAnalysis.AnalysisMiddleware.Invoke(HttpContext httpContext) at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication1 application)

SimonPapworth · 2021-07-30T08:24:55Z

We had to code around this because some disconnects cause the client to never connect again, we had a clever man make some changes to our code and it now forces a new connection to be created after a number of a particular client exceptions within a short time.

Of course that only works when there is a good amount of traffic 24/7.

NickCraver · 2022-09-04T21:01:12Z

Forgot to follow this up - please see #1848 for the platform-level issue and how to resolve if you encounter this!

philon-msft · 2023-12-12T17:01:48Z

Update: a new version 2.7.10 has been released, including #2610 to detect and recover stalled sockets. This should help prevent the situation where connections can stall for ~15 minutes on Linux clients.

NickCraver added the ⚙️ area:connection label Jun 26, 2021

NickCraver added the 💬 needs-info label Jun 26, 2021

bcage29 mentioned this issue Aug 25, 2021

Connection does not re-establish for 15 minutes when running on Linux #1848

Closed

NickCraver closed this as completed Sep 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the network is abnormally disconnected, the connection will not be restored, and the access will generate a timeout error #1782

When the network is abnormally disconnected, the connection will not be restored, and the access will generate a timeout error #1782

Adam-Jin commented Jun 26, 2021

NickCraver commented Jun 26, 2021

Adam-Jin commented Jun 28, 2021

Adam-Jin commented Jun 28, 2021

SimonPapworth commented Jul 30, 2021

NickCraver commented Sep 4, 2022

philon-msft commented Dec 12, 2023

When the network is abnormally disconnected, the connection will not be restored, and the access will generate a timeout error #1782

When the network is abnormally disconnected, the connection will not be restored, and the access will generate a timeout error #1782

Comments

Adam-Jin commented Jun 26, 2021

NickCraver commented Jun 26, 2021

Adam-Jin commented Jun 28, 2021

Adam-Jin commented Jun 28, 2021

SimonPapworth commented Jul 30, 2021

NickCraver commented Sep 4, 2022

philon-msft commented Dec 12, 2023