metrics-generator excessive err-mimir-sample-duplicate-timestamp #4201

msvechla · 2024-10-17T10:54:15Z

Describe the bug

Hello,

We are using the Tempo metrics-generator to generate span metrics from traces.

In general this works, however our metrics-generator is throwing lots of err-mimir-sample-duplicate-timestamp errors in the logs.

The error is thrown on average about 250 times per minute:

Some sample log lines:

ts=2024-10-17T10:42:13.405615049Z caller=dedupe.go:112 tenant=single-tenant component=remote level=error remote_name=45e95a url=http://mimir-gateway.default.svc.cluster.local/api/v1/push msg="non-recoverable error" failedSampleCount=2000 failedHistogramCount=0 failedExemplarCount=0 err="server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester mimir-ingester-2: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-10-17T10:41:57.946Z and is from series traces_spanmetrics_latency_bucket{__metrics_gen_instance=\"tempo-metrics-generator-dcf9fd768-z29fq\", http_request_method=\"POST\", http_response_status_code=\"200\", le=\"1.02\", platform_aws_organizational_unit=\"fleetassets\", platform_aws_stage=\"prod\", service=\"vit-image-detect-labels\", span_kind=\"SPAN_KIND_CLIENT\", span_name=\"POST\", status_code=\"STATUS_CODE_UNSET\"} (sampled 1/10)\n"

ts=2024-10-17T10:42:11.722394528Z caller=dedupe.go:112 tenant=single-tenant component=remote level=error remote_name=45e95a url=http://mimir-gateway.default.svc.cluster.local/api/v1/push msg="non-recoverable error" failedSampleCount=2000 failedHistogramCount=0 failedExemplarCount=0 err="server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester mimir-ingester-2: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-10-17T10:41:57.946Z and is from series traces_spanmetrics_latency_bucket{__metrics_gen_instance=\"tempo-metrics-generator-dcf9fd768-z29fq\", http_request_method=\"GET\", http_response_status_code=\"200\", le=\"1.02\", platform_aws_organizational_unit=\"fleetops\", platform_aws_stage=\"prod\", service=\"fmm-graphql-gateway\", span_kind=\"SPAN_KIND_CLIENT\", span_name=\"GET\", status_code=\"STATUS_CODE_UNSET\"} (sampled 1/10)\n"

In our infrastructure this seems to be mostly coming from metrics generated from auto-instrumented nodejs services, however this might be the case for other services as well.

Expected behavior

err-mimir-sample-duplicate-timestamp should not be thrown regularly

Environment:

Infrastructure: Kubernetes, OTEL Collector and auto-instrumentation, Mimir, Tempo
Deployment tool: Helm

Additional Context

metrics-generator config:

    metrics_generator:
      metrics_ingestion_time_range_slack: 30s
      processor:
        service_graphs:
          dimensions:
          - platform.aws.organizational_unit
          - platform.aws.stage
          histogram_buckets:
          - 0.1
          - 0.2
          - 0.4
          - 0.8
          - 1.6
          - 3.2
          - 6.4
          - 12.8
          max_items: 10000
          wait: 10s
          workers: 10
        span_metrics:
          dimension_mappings:
          - name: http_request_method
            source_labels:
            - http.method
          - name: http_response_status_code
            source_labels:
            - http.status_code
          dimensions:
          - platform.aws.organizational_unit
          - platform.aws.stage
          - messaging.operation
          - http.request.method
          - http.response.status_code
          - http.route
          - http.method
          - http.status_code
          histogram_buckets:
          - 0.002
          - 0.004
          - 0.008
          - 0.016
          - 0.032
          - 0.064
          - 0.128
          - 0.256
          - 0.512
          - 1.02
          - 2.05
          - 4.1
      registry:
        collection_interval: 15s
        external_labels: {}
        stale_duration: 15m
      ring:
        kvstore:
          store: memberlist
      storage:
        path: /var/tempo/wal
        remote_write:
        - send_exemplars: true
          url: ...
          write_relabel_configs:
          - regex: ^(.+)$
            source_labels:
            - http_method
            target_label: http_request_method
          - regex: ^(.+)$
            source_labels:
            - http_status_code
            target_label: http_response_status_code
          - action: labeldrop
            regex: ^http_method|http_status_code$

Thanks a lot for your help!

The text was updated successfully, but these errors were encountered:

yvrhdn · 2024-10-18T17:43:05Z

Hi, agree this shouldn't be happening. We add a unique label __metrics_gen_instance to the series generated by each metrics-generator to ensure the series are unique between them. Since you are running on Kubernetes the hostnames for each pod should be unique.

Can you check some things?

Does this only happen to traces_spanmetrics_latency_bucket or to all metrics types? (so also counters)
Do the log lines have mostly the same attributes or are they very distinct? As in, is it just one serie being repeated over and over ro do you see a mix of series.
Do you happen to have out-of-order sample ingestion enabled in Mimir? We shouldn't need it, but it might be interesting to see if this changes things.

Thanks!

Our current theory is that this might be due to how we are injecting a 0 sample when adding new histogram series. Whenever we remote write a serie for the first time, we inject a 0 sample right before appending the actual value to clearly mark is as going from 0 to the new value:

tempo/modules/generator/registry/histogram.go

Lines 190 to 199 in 1652db2

    
           // If we are about to call Append for the first time on a series, 
        
           // we need to first insert a 0 value to allow Prometheus to start from a non-null value. 
        
           if s.isNew() { 
        
           	lb.Set(labels.MetricName, h.nameCount) 
        
           	_, err = appender.Append(0, lb.Labels(), t-1, 0) // t-1 to ensure that the next value is not at the same time 
        
           	if err != nil { 
        
           		return 
        
           	} 
        
           	s.registerSeenSeries() 
        
           }

This code adds a sample with value 0 1ms before the normal sample.

We have similar code in counter. But instead of injecting a sample 1ms earlier, we delay the next samples by 1s:

tempo/modules/generator/registry/counter.go

Lines 157 to 169 in 1652db2

    
           // If we are about to call Append for the first time on a series, we need 
        
           // to first insert a 0 value to allow Prometheus to start from a non-null 
        
           // value. 
        
           if s.isNew() { 
        
           	_, err = appender.Append(0, lb.Labels(), timeMs, 0) 
        
           	if err != nil { 
        
           		return 
        
           	} 
        
           	// Increment timeMs to ensure that the next value is not at the same time. 
        
           	t = t.Add(insertOffsetDuration) 
        
           	s.registerSeenSeries() 
        
           }

So if you only see histograms causing the duplicate sample errors, that is a clear indicator something in that implementation is not right.

msvechla · 2024-10-21T07:02:37Z

Hi and thanks for your detailed analysis!

Does this only happen to traces_spanmetrics_latency_bucket or to all metrics types? (so also counters)

It appears to be happening with all kinds of series, however the largest amount of errors is coming from the histogram metrics

Do the log lines have mostly the same attributes or are they very distinct? As in, is it just one serie being repeated over and over ro do you see a mix of series.

most errors have very similar attributes, all of them have a very generic span_name like POST or GET

Do you happen to have out-of-order sample ingestion enabled in Mimir? We shouldn't need it, but it might be interesting to see if this changes things.

I will give this a try and report back

msvechla · 2024-10-22T07:56:20Z

@kvrhdn I also tried enabling out-of-order sample ingestion, but this had no effect at all. We are still getting the same amount of err-mimir-sample-duplicate-timestamp

yvrhdn · 2024-10-25T17:18:52Z

Do you maybe have any relabel configs on your remote write? Maybe you are dropping labels that would make series unique.

Or do you maybe have multiple sources of these metrics? (e.g. you are sending from both Tempo and Alloy)

Our current theory is that this might be due to how we are injecting a 0 sample when adding new histogram series.

We did some more investigation in this and it shouldn't be the issue. The errors we were seeing somewhere else were related to aggregation.

msvechla · 2024-10-28T07:57:32Z

Do you maybe have any relabel configs on your remote write? Maybe you are dropping labels that would make series unique.

The only remote-write config we have I posted in our config above. To my knowledge this should just rename the metrics and not cause any loss in uniqueness:

storage:
        path: /var/tempo/wal
        remote_write:
        - send_exemplars: true
          url: ...
          write_relabel_configs:
          - regex: ^(.+)$
            source_labels:
            - http_method
            target_label: http_request_method
          - regex: ^(.+)$
            source_labels:
            - http_status_code
            target_label: http_response_status_code
          - action: labeldrop
            regex: ^http_method|http_status_code$

msvechla · 2024-10-28T11:04:56Z

@kvrhdn it looks like removing the above mentioned write_relabel_configs fixes the issue.

Any idea what is causing issues with this relabel configs? As I mentioned, I just want to rename http_method to http_request_method and http_status_code to http_response_status_code.

The ingested metrics look exactly like we would expect:

With relabel config:

traces_spanmetrics_calls_total{__metrics_gen_instance="tempo-metrics-generator-85ccf9bcdc-lw8sh",http_request_method="GET",http_response_status_code="200",platform_aws_organizational_unit="fleetops",platform_aws_stage="prod",service="fmm-graphql-gateway",span_kind="SPAN_KIND_CLIENT",span_name="GET",status_code="STATUS_CODE_UNSET"}

Without relabel config:

traces_spanmetrics_calls_total{__metrics_gen_instance="tempo-metrics-generator-85ccf9bcdc-lw8sh",http_method="GET",http_status_code="200",platform_aws_organizational_unit="fleetops",platform_aws_stage="prod",service="fmm-graphql-gateway",span_kind="SPAN_KIND_CLIENT",span_name="GET",status_code="STATUS_CODE_UNSET"}

yvrhdn · 2024-10-29T15:33:01Z

Nice find! Yeah, I'm not sure why this relabel config is causing issues (but I'm also not an expert on this). Maybe ^(.+)$ is too restrictive? You could try (.*) instead (the default) and see if that changes things.

Metrics-generator also has support built in to remap dimensions: it's the dimensions_mappings setting
https://grafana.com/docs/tempo/latest/configuration/#metrics-generator

I think this should work:

metrics_generator:
  processor:
    span_metrics:
      dimension_mappings:
        - name: http_request_method
          source_labels: http.method
          join: ''
        - name: http_response_status_code
          source_labels: http.status_code
          join: ''

Note that source_labels is the attribute name on the span, name is the prometheus label. This can be a bit confusing.

mapno self-assigned this Oct 17, 2024

mapno added type/bug Something isn't working component/metrics-generator labels Oct 17, 2024

mapno assigned javiermolinar and unassigned javiermolinar and mapno Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics-generator excessive err-mimir-sample-duplicate-timestamp #4201

metrics-generator excessive err-mimir-sample-duplicate-timestamp #4201

msvechla commented Oct 17, 2024

yvrhdn commented Oct 18, 2024

msvechla commented Oct 21, 2024

msvechla commented Oct 22, 2024

yvrhdn commented Oct 25, 2024

msvechla commented Oct 28, 2024

msvechla commented Oct 28, 2024 •

edited

Loading

yvrhdn commented Oct 29, 2024

metrics-generator excessive err-mimir-sample-duplicate-timestamp #4201

metrics-generator excessive err-mimir-sample-duplicate-timestamp #4201

Comments

msvechla commented Oct 17, 2024

yvrhdn commented Oct 18, 2024

msvechla commented Oct 21, 2024

msvechla commented Oct 22, 2024

yvrhdn commented Oct 25, 2024

msvechla commented Oct 28, 2024

msvechla commented Oct 28, 2024 • edited Loading

yvrhdn commented Oct 29, 2024

msvechla commented Oct 28, 2024 •

edited

Loading