Periodic batches dropped because of Maximum timeout #383

BramVader · 2021-02-12T14:06:16Z

Does this issue relate to a new feature or an existing bug?

Bug
New Feature

What version of Serilog.Sinks.Elasticsearch is affected? Please list the related NuGet package.
Serilog.Sinks.Elasticsearch 8.4.1

What is the target framework and operating system? See target frameworks & net standard matrix.

Please describe the current behavior?
I'm testing the reliability of the Elasticsearch sink, by emitting 1000000 items, using durable buffering. Shipping takes about 18 minutes (about 950 items/second). I notice that during this process sometimes the following exception occurs (in the SelfLog):

2021-02-12T12:55:56.5332286Z Exception while emitting periodic batch from Serilog.Sinks.Elasticsearch.Durable.ElasticsearchLogClient: Elasticsearch.Net.ElasticsearchClientException: Maximum timeout reached while retrying request. Call: Status code unknown from: POST /_bulk
 ---> System.Threading.Tasks.TaskCanceledException: The operation was canceled.
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Elasticsearch.Net.HttpConnection.RequestAsync[TResponse](RequestData requestData, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at Elasticsearch.Net.Transport`1.HandleElasticsearchClientException(RequestData data, Exception clientException, IElasticsearchResponse response)
   at Elasticsearch.Net.Transport`1.FinalizeResponse[TResponse](RequestData requestData, IRequestPipeline pipeline, List`1 seenExceptions, TResponse response)
   at Elasticsearch.Net.Transport`1.RequestAsync[TResponse](HttpMethod method, String path, CancellationToken cancellationToken, PostData data, IRequestParameters requestParameters)
   at Serilog.Sinks.Elasticsearch.Durable.ElasticsearchLogClient.SendPayloadAsync(List`1 payload, Boolean first)

I already changed the sinkOptions.ConnectionTimeout to a very long time (one day) instead of 5 seconds, but that does not seem to make a difference.

I'm wondering why this happens. Maybe Elasticsearch can't keep up the indexing rate and starts throttling the connection. But what seems really strange to me is that this batch will be considered "lost" and Durable buffering just starts with the next batch. In my opinion, it should never skip data because of failure - the only exception would be when the filebuffer reaches its sinkOptions.BufferFileSizeLimitBytes (configured to 100MB).

Please describe the expected behavior?
It should keep retrying to send the current batch, before starting with the next batch.
Also is there a way to configure this behavior, as it seems that sinkOptions.ConnectionTimeout does not have effect.

The text was updated successfully, but these errors were encountered:

mivano · 2021-02-13T23:35:40Z

Thanks for reporting this. The durable part is however a difficult beast in this sink and I wished it was a bit more reliable. I do not have an answer why this happens, but this should not lead to data loss. I do advise that with these kinds of scenarios you are better off with a real queue implementation.

BramVader · 2021-02-15T07:40:00Z

Thanks Michael for your quick (but rather unsatisfying) answer! I'm just wondering: Am I the only one that is trying to use durable buffering and/or are others experiencing the same problems? Is it because I'm running both Elasticsearch and my testcode in Kubernetes?
In the meantime, I'm going to try to reproduce this problem locally, running my testcode in Visual Studio and having the source code of both Elasticsearch.NET and Serilog.Sinks.Elasticsearch checked out (and using project references). Maybe I can fix it...

mivano · 2021-02-15T08:39:33Z

The durable implementation is a copy of the Seq one. Code bases have diverged and the amount of tests is not good enough to keep it up to date. It was suppose to be a nice feature to handle occasionally disconnected scenarios. However, I do see a lot of issues reported about this durable part.

I would rather remove this part from this sink as it contains duplicated code as well and makes maintenance harder. But their lacks a better alternative. (#254 (comment))

This sink is great at starting simple to ingest data into ES. It handles a nice set of defaults and has extension points. However, it is not 100% reliable. A high load, network issues etc can cause it to go down. In line with the Serilog principle; the logger should not impact the application (https://github.com/serilog/serilog/wiki/Reliability). This is also not an audit logger (documentation is sparse, but see here serilog/serilog#1170).

I m not a current user of the sink myself anymore, so I m more than happy if someone can help out, so if you get some ideas, please share!

zhangzw218 · 2024-01-12T14:11:56Z

It's a very old problem and it doesn't seem to be solved. Because I'm in the same situation right now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodic batches dropped because of Maximum timeout #383

Periodic batches dropped because of Maximum timeout #383

BramVader commented Feb 12, 2021

mivano commented Feb 13, 2021

BramVader commented Feb 15, 2021

mivano commented Feb 15, 2021

zhangzw218 commented Jan 12, 2024

Periodic batches dropped because of Maximum timeout #383

Periodic batches dropped because of Maximum timeout #383

Comments

BramVader commented Feb 12, 2021

mivano commented Feb 13, 2021

BramVader commented Feb 15, 2021

mivano commented Feb 15, 2021

zhangzw218 commented Jan 12, 2024