Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Problem] How can I improve tempo query performance #4239

Open
g3david405 opened this issue Oct 26, 2024 · 5 comments
Open

[Problem] How can I improve tempo query performance #4239

g3david405 opened this issue Oct 26, 2024 · 5 comments

Comments

@g3david405
Copy link

g3david405 commented Oct 26, 2024

I am using the latest version of tempo-distributed (v2.6.1), and my data volume is approximately 1,000 records per second, with a retention period of 21 days totaling around 900 GB. When performing TraceQL queries, I’m encountering significant performance bottlenecks, especially when querying span or resource attributes.

According to this article,
https://grafana.com/docs/tempo/latest/operations/backend_search/
Here are the improvements I've implemented so far:

  1. Using the vParquet4 search engine and configuring dedicated_column for specific span or resource attributes.
  2. Enabling stream_over_http_enabled to allow Grafana to perform queries via streaming.
  3. Scaling out the querier by increasing replicas to 6.
  4. Adjusting the querier’s max_concurrent_queries and queryFrontend’s concurrent_jobs.
  5. Adding scope to attribute queries in TraceQL, for example:
    .http.request.method = "GET" → span.http.request.method = "GET"

However, despite these adjustments, the performance is still below acceptable levels. Are there any additional optimizations I could make?

my helm chart values is as following:

tempo:
  structuredConfig:
    stream_over_http_enabled: true

metricsGenerator:
  enabled: true
  config:
    storage:
      remote_write:
        - url: http://prometheus-server.prometheus.svc.cluster.local/api/v1/write
          send_exemplars: true

ingester:
  resources:
    limits:
      memory: 8Gi

queryFrontend:
  replicas: 2
  config:
    max_outstanding_per_tenant: 2000
    search:
      concurrent_jobs: 100
      target_bytes_per_job: 52428800

querier:
  replicas: 6
  resources:
    limits:
      memory: 10Gi
  config:
    search:
      query_timeout: 60s
    max_concurrent_queries: 30

compactor:
  replicas: 3
  config:
    compaction:
      block_retention: 504h

distributor:
  replicas: 3

traces:
  otlp:
    http:
      enabled: true
    grpc:
      enabled: true

storage:
  trace:
    block:
      version: vParquet4
      dedicated_columns:
        - name: service.name
          type: string
          scope: resource
        - name: k8s.namespace.name
          type: string
          scope: resource
        - name: url.path
          type: string
          scope: span
        - name: http.route
          type: string
          scope: span
        - name: http.target
          type: string
          scope: span
        - name: http.request.method
          type: string
          scope: span
        - name: http.response.status_code
          type: string
          scope: span
        - name: db.name
          type: string
          scope: span
        - name: db.system
          type: string
          scope: span
        - name: peer.service
          type: string
          scope: span
    backend: s3
    s3:
      access_key: 'xxx'
      secret_key: 'xxx'
      bucket: 'tempo-bucket'
      endpoint: 'minio.tenant.svc.cluster.local'
      insecure: true

global_overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

server:
  http_server_read_timeout: 2m
  http_server_write_timeout: 2m
  grpc_server_max_recv_msg_size: 16777216
  grpc_server_max_send_msg_size: 16777216
@joe-elliott
Copy link
Member

1000 spans/second is quite low and I'm surprised you're seeing performance issues. Query performance can be quite difficult to debug remotely, but we will do our best.

Can you provide an example of a traceql query that is having issues and the corresponding query frontend logs? We can start there and try to figure out what is causing issues.

@g3david405
Copy link
Author

Hi @joe-elliott
Here is my TraceQL, query backend API trace with status 5xx, limit 100 and with 12 hour duration.

{ resource.service.name =~ "$Job" && (span.url.path=~"$Url" || span.http.route=~"$Url" || span.http.target=~"$Url") && span.http.request.method=~"$Method" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~"$Namespace" || "$Namespace" = ".*") }

Here is my query-frontend log

level=info ts=2024-11-02T16:05:11.870021575Z caller=search_handlers.go:186 msg="search request" tenant=single-tenant query="{ resource.service.name =~ \"{MicroService Name}\" && (span.url.path=~\"\" || span.http.route=~\"\" || span.http.target=~\"\") && span.http.request.method=~\".+\" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~\".*\" || \".*\" = \".*\") }" range_seconds=43200 limit=100 spans_per_spanset=3
level=info ts=2024-11-02T16:05:45.5573293Z caller=reporter.go:257 msg="reporting cluster stats" date=2024-11-02T16:05:45.557324874Z
level=info ts=2024-11-02T16:05:45.904376455Z caller=poller.go:256 msg="successfully pulled tenant index" tenant=single-tenant createdAt=2024-11-02T16:00:37.074394709Z metas=547 compactedMetas=28
level=info ts=2024-11-02T16:05:45.904500382Z caller=poller.go:142 msg="blocklist poll complete" seconds=0.371037565
level=info ts=2024-11-02T16:06:38.069118075Z caller=search_handlers.go:167 msg="search results" tenant=single-tenant query="{ resource.service.name =~ \"{MicroService Name}\" && (span.url.path=~\"\" || span.http.route=~\"\" || span.http.target=~\"\") && span.http.request.method=~\".+\" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~\".*\" || \".*\" = \".*\") }" range_seconds=43200 duration_seconds=86.199097056 request_throughput=4.140089054159755e+07 total_requests=409 total_blockBytes=29200227564 total_blocks=31 completed_requests=409 inspected_bytes=3568719382 inspected_traces=0 inspected_spans=0 status_code=-1 error=null

In this example, I only query with microservice name.
If I query with API path, the query time will be longer.

And here is my grafana frontend shows streaming result
Image

I can give more information if you need,
thanks!!!

@g3david405
Copy link
Author

Any Update?

@g3david405
Copy link
Author

Still waiting for the reply,
thx

@joe-elliott
Copy link
Member

Howdy, apologies for letting this sit. Been overwhelmed a bit by repo activity lately.

{ resource.service.name =~ "$Job" && (span.url.path=~"$Url" || span.http.route=~"$Url" || span.http.target=~"$Url") && span.http.request.method=~"$Method" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~"$Namespace" || "$Namespace" = ".*") }

This is a brutal query. The obvious issue is the number of regexes. The less obvious issue is that Tempo is really fast at evaluating a bunch of &&'ed conditions and way slower and mixed sets of conditions.

The good news:

  • We are looking at regex performance right now and have plans to improve
  • Two PRs in main are already focused at improving the performance of mixed conditions.

Expect this query to improve in 2.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants