Integrate OpenTelemetry and Prometheus #106

sergiimk · 2024-07-23T04:06:11Z

Two new util crates are introduced:

observability common things for logs, tracing, and metrics
graceful-shutdown mini-crate for signal handling

Graceful shutdown was implemented for api-server. Currently only waits for pending HTTP requests to finish, not other protocols.

Bunyan log format was removed - instead the apps will auto-detect the environment:

if stderr points to the TTY (developer mode) it will use pretty text format (see example below)
otherwise will use tracing's built-in json format:

Better defaults for HTTP tracing were configured, the example below shows span start/end (note method, route span fields) along with HTTP request and HTTP response events that include full URI, headers, and latency:

  2024-07-23T03:54:18.107150Z  INFO observability::axum: new
    at src/utils/observability/src/axum.rs:86 on tokio-runtime-worker
    in observability::axum::http_request with method: GET, route: /:account/:dataset

  2024-07-23T03:54:18.107294Z  INFO observability::axum: HTTP request, uri: /foo/bar?some=param, version: HTTP/1.1, headers: {"accept-encoding": "gzip, deflate, br", "user-agent": "xh/0.22.2", "connection": "keep-alive", "accept": "*/*", "host": "localhost:3003"}
    at src/utils/observability/src/axum.rs:33 on tokio-runtime-worker
    in observability::axum::http_request with method: GET, route: /:account/:dataset

  2024-07-23T03:54:18.107595Z  INFO observability::axum: HTTP response, status: 405, headers: {"content-length": "0", "access-control-allow-origin": "*", "vary": "origin", "vary": "access-control-request-method", "vary": "access-control-request-headers"}, latency: 0 ms
    at src/utils/observability/src/axum.rs:54 on tokio-runtime-worker
    in observability::axum::http_request with method: GET, route: /:account/:dataset

  2024-07-23T03:54:18.107806Z  INFO observability::axum: close, time.busy: 389µs, time.idle: 272µs
    at src/utils/observability/src/axum.rs:86 on tokio-runtime-worker
    in observability::axum::http_request with method: GET, route: /:account/:dataset

When OTLP_ENDPOINT env var is provided an OpenTelemetry layer will be configured:

trace_id will appear in the root span allowing us to link logs to traces in Grafana
Traces will be sent via grpc to the OTEL collector

New system endpoints are proposed:

/system/health?type={liveness,readiness,startup} - using k8s semantics
/system/metrics for Prometheus metrics

Order of middlewares was modified to have /system/health outside of tracing middleware not to produce too much spam.

No-op health endpoints were added to api-server and oracle-provider apps.

Prometheus metrics were added to oracle-provider app as a test of integration.

s373r

Added several minor questions

src/app/oracle-provider/src/provider.rs

src/utils/observability/src/axum.rs

src/utils/observability/src/config.rs

zaychenko-sergei · 2024-07-23T12:35:47Z

As I understood, this PR will have to wait for Hauki to configure the endpoint?

sergiimk · 2024-07-23T17:06:32Z

As I understood, this PR will have to wait for Hauki to configure the endpoint?

@zaychenko-sergei , not really - we can deploy the new version and turn on OTLP_ENDPOINT later.

sergiimk requested review from zaychenko-sergei and s373r July 23, 2024 04:06

sergiimk force-pushed the feature/observability branch from 9453a60 to e6da15e Compare July 23, 2024 04:12

s373r approved these changes Jul 23, 2024

View reviewed changes

src/app/oracle-provider/src/provider.rs Show resolved Hide resolved

src/utils/observability/src/axum.rs Show resolved Hide resolved

src/utils/observability/src/axum.rs Show resolved Hide resolved

src/utils/observability/src/config.rs Outdated Show resolved Hide resolved

sergiimk force-pushed the feature/observability branch 2 times, most recently from 9f25f7f to c73aaa5 Compare July 23, 2024 17:39

Integrate OpenTelemetry and Prometheus

efd12d2

sergiimk force-pushed the feature/observability branch from c73aaa5 to efd12d2 Compare July 23, 2024 17:53

sergiimk merged commit efd12d2 into master Jul 23, 2024
3 checks passed

sergiimk deleted the feature/observability branch July 23, 2024 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate OpenTelemetry and Prometheus #106

Integrate OpenTelemetry and Prometheus #106

sergiimk commented Jul 23, 2024

s373r left a comment

zaychenko-sergei commented Jul 23, 2024

sergiimk commented Jul 23, 2024

Integrate OpenTelemetry and Prometheus #106

Integrate OpenTelemetry and Prometheus #106

Conversation

sergiimk commented Jul 23, 2024

s373r left a comment

Choose a reason for hiding this comment

zaychenko-sergei commented Jul 23, 2024

sergiimk commented Jul 23, 2024