Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics-generator excessive err-mimir-sample-duplicate-timestamp #4201

Open
msvechla opened this issue Oct 17, 2024 · 7 comments
Open

metrics-generator excessive err-mimir-sample-duplicate-timestamp #4201

msvechla opened this issue Oct 17, 2024 · 7 comments
Labels

Comments

@msvechla
Copy link

Describe the bug

Hello,

We are using the Tempo metrics-generator to generate span metrics from traces.

In general this works, however our metrics-generator is throwing lots of err-mimir-sample-duplicate-timestamp errors in the logs.

The error is thrown on average about 250 times per minute:
Image

Some sample log lines:

ts=2024-10-17T10:42:13.405615049Z caller=dedupe.go:112 tenant=single-tenant component=remote level=error remote_name=45e95a url=http://mimir-gateway.default.svc.cluster.local/api/v1/push msg="non-recoverable error" failedSampleCount=2000 failedHistogramCount=0 failedExemplarCount=0 err="server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester mimir-ingester-2: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-10-17T10:41:57.946Z and is from series traces_spanmetrics_latency_bucket{__metrics_gen_instance=\"tempo-metrics-generator-dcf9fd768-z29fq\", http_request_method=\"POST\", http_response_status_code=\"200\", le=\"1.02\", platform_aws_organizational_unit=\"fleetassets\", platform_aws_stage=\"prod\", service=\"vit-image-detect-labels\", span_kind=\"SPAN_KIND_CLIENT\", span_name=\"POST\", status_code=\"STATUS_CODE_UNSET\"} (sampled 1/10)\n"
ts=2024-10-17T10:42:11.722394528Z caller=dedupe.go:112 tenant=single-tenant component=remote level=error remote_name=45e95a url=http://mimir-gateway.default.svc.cluster.local/api/v1/push msg="non-recoverable error" failedSampleCount=2000 failedHistogramCount=0 failedExemplarCount=0 err="server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester mimir-ingester-2: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-10-17T10:41:57.946Z and is from series traces_spanmetrics_latency_bucket{__metrics_gen_instance=\"tempo-metrics-generator-dcf9fd768-z29fq\", http_request_method=\"GET\", http_response_status_code=\"200\", le=\"1.02\", platform_aws_organizational_unit=\"fleetops\", platform_aws_stage=\"prod\", service=\"fmm-graphql-gateway\", span_kind=\"SPAN_KIND_CLIENT\", span_name=\"GET\", status_code=\"STATUS_CODE_UNSET\"} (sampled 1/10)\n"

In our infrastructure this seems to be mostly coming from metrics generated from auto-instrumented nodejs services, however this might be the case for other services as well.

Expected behavior

err-mimir-sample-duplicate-timestamp should not be thrown regularly

Environment:

  • Infrastructure: Kubernetes, OTEL Collector and auto-instrumentation, Mimir, Tempo
  • Deployment tool: Helm

Additional Context

metrics-generator config:

    metrics_generator:
      metrics_ingestion_time_range_slack: 30s
      processor:
        service_graphs:
          dimensions:
          - platform.aws.organizational_unit
          - platform.aws.stage
          histogram_buckets:
          - 0.1
          - 0.2
          - 0.4
          - 0.8
          - 1.6
          - 3.2
          - 6.4
          - 12.8
          max_items: 10000
          wait: 10s
          workers: 10
        span_metrics:
          dimension_mappings:
          - name: http_request_method
            source_labels:
            - http.method
          - name: http_response_status_code
            source_labels:
            - http.status_code
          dimensions:
          - platform.aws.organizational_unit
          - platform.aws.stage
          - messaging.operation
          - http.request.method
          - http.response.status_code
          - http.route
          - http.method
          - http.status_code
          histogram_buckets:
          - 0.002
          - 0.004
          - 0.008
          - 0.016
          - 0.032
          - 0.064
          - 0.128
          - 0.256
          - 0.512
          - 1.02
          - 2.05
          - 4.1
      registry:
        collection_interval: 15s
        external_labels: {}
        stale_duration: 15m
      ring:
        kvstore:
          store: memberlist
      storage:
        path: /var/tempo/wal
        remote_write:
        - send_exemplars: true
          url: ...
          write_relabel_configs:
          - regex: ^(.+)$
            source_labels:
            - http_method
            target_label: http_request_method
          - regex: ^(.+)$
            source_labels:
            - http_status_code
            target_label: http_response_status_code
          - action: labeldrop
            regex: ^http_method|http_status_code$

Thanks a lot for your help!

@mapno mapno self-assigned this Oct 17, 2024
@mapno mapno added type/bug Something isn't working component/metrics-generator labels Oct 17, 2024
@mapno mapno assigned javiermolinar and unassigned javiermolinar and mapno Oct 18, 2024
@yvrhdn
Copy link
Member

yvrhdn commented Oct 18, 2024

Hi, agree this shouldn't be happening. We add a unique label __metrics_gen_instance to the series generated by each metrics-generator to ensure the series are unique between them. Since you are running on Kubernetes the hostnames for each pod should be unique.

Can you check some things?

  • Does this only happen to traces_spanmetrics_latency_bucket or to all metrics types? (so also counters)
  • Do the log lines have mostly the same attributes or are they very distinct? As in, is it just one serie being repeated over and over ro do you see a mix of series.
  • Do you happen to have out-of-order sample ingestion enabled in Mimir? We shouldn't need it, but it might be interesting to see if this changes things.

Thanks!


Our current theory is that this might be due to how we are injecting a 0 sample when adding new histogram series. Whenever we remote write a serie for the first time, we inject a 0 sample right before appending the actual value to clearly mark is as going from 0 to the new value:

// If we are about to call Append for the first time on a series,
// we need to first insert a 0 value to allow Prometheus to start from a non-null value.
if s.isNew() {
lb.Set(labels.MetricName, h.nameCount)
_, err = appender.Append(0, lb.Labels(), t-1, 0) // t-1 to ensure that the next value is not at the same time
if err != nil {
return
}
s.registerSeenSeries()
}

This code adds a sample with value 0 1ms before the normal sample.

We have similar code in counter. But instead of injecting a sample 1ms earlier, we delay the next samples by 1s:

// If we are about to call Append for the first time on a series, we need
// to first insert a 0 value to allow Prometheus to start from a non-null
// value.
if s.isNew() {
_, err = appender.Append(0, lb.Labels(), timeMs, 0)
if err != nil {
return
}
// Increment timeMs to ensure that the next value is not at the same time.
t = t.Add(insertOffsetDuration)
s.registerSeenSeries()
}

So if you only see histograms causing the duplicate sample errors, that is a clear indicator something in that implementation is not right.

@msvechla
Copy link
Author

Hi and thanks for your detailed analysis!

Image

Does this only happen to traces_spanmetrics_latency_bucket or to all metrics types? (so also counters)

  • It appears to be happening with all kinds of series, however the largest amount of errors is coming from the histogram metrics

Do the log lines have mostly the same attributes or are they very distinct? As in, is it just one serie being repeated over and over ro do you see a mix of series.

  • most errors have very similar attributes, all of them have a very generic span_name like POST or GET

Do you happen to have out-of-order sample ingestion enabled in Mimir? We shouldn't need it, but it might be interesting to see if this changes things.

  • I will give this a try and report back

@msvechla
Copy link
Author

@kvrhdn I also tried enabling out-of-order sample ingestion, but this had no effect at all. We are still getting the same amount of err-mimir-sample-duplicate-timestamp

@yvrhdn
Copy link
Member

yvrhdn commented Oct 25, 2024

Do you maybe have any relabel configs on your remote write? Maybe you are dropping labels that would make series unique.

Or do you maybe have multiple sources of these metrics? (e.g. you are sending from both Tempo and Alloy)

Our current theory is that this might be due to how we are injecting a 0 sample when adding new histogram series.

We did some more investigation in this and it shouldn't be the issue. The errors we were seeing somewhere else were related to aggregation.

@msvechla
Copy link
Author

Do you maybe have any relabel configs on your remote write? Maybe you are dropping labels that would make series unique.

The only remote-write config we have I posted in our config above. To my knowledge this should just rename the metrics and not cause any loss in uniqueness:

storage:
        path: /var/tempo/wal
        remote_write:
        - send_exemplars: true
          url: ...
          write_relabel_configs:
          - regex: ^(.+)$
            source_labels:
            - http_method
            target_label: http_request_method
          - regex: ^(.+)$
            source_labels:
            - http_status_code
            target_label: http_response_status_code
          - action: labeldrop
            regex: ^http_method|http_status_code$

@msvechla
Copy link
Author

msvechla commented Oct 28, 2024

@kvrhdn it looks like removing the above mentioned write_relabel_configs fixes the issue.

Any idea what is causing issues with this relabel configs? As I mentioned, I just want to rename http_method to http_request_method and http_status_code to http_response_status_code.

The ingested metrics look exactly like we would expect:

With relabel config:

traces_spanmetrics_calls_total{__metrics_gen_instance="tempo-metrics-generator-85ccf9bcdc-lw8sh",http_request_method="GET",http_response_status_code="200",platform_aws_organizational_unit="fleetops",platform_aws_stage="prod",service="fmm-graphql-gateway",span_kind="SPAN_KIND_CLIENT",span_name="GET",status_code="STATUS_CODE_UNSET"}

Without relabel config:

traces_spanmetrics_calls_total{__metrics_gen_instance="tempo-metrics-generator-85ccf9bcdc-lw8sh",http_method="GET",http_status_code="200",platform_aws_organizational_unit="fleetops",platform_aws_stage="prod",service="fmm-graphql-gateway",span_kind="SPAN_KIND_CLIENT",span_name="GET",status_code="STATUS_CODE_UNSET"}

@yvrhdn
Copy link
Member

yvrhdn commented Oct 29, 2024

Nice find! Yeah, I'm not sure why this relabel config is causing issues (but I'm also not an expert on this). Maybe ^(.+)$ is too restrictive? You could try (.*) instead (the default) and see if that changes things.

Metrics-generator also has support built in to remap dimensions: it's the dimensions_mappings setting
https://grafana.com/docs/tempo/latest/configuration/#metrics-generator

I think this should work:

metrics_generator:
  processor:
    span_metrics:
      dimension_mappings:
        - name: http_request_method
          source_labels: http.method
          join: ''
        - name: http_response_status_code
          source_labels: http.status_code
          join: ''

Note that source_labels is the attribute name on the span, name is the prometheus label. This can be a bit confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants