You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To better understand the timing breakdown for the entire proposal duty flow in real clusters (both in test and production), we aim to visualize the time spent on tasks such as querying BN, achieving consensus, and other processes for specific duties or slots. This data will help us investigate missing proposals more quickly and, if necessary, adjust consensus timings appropriately or shift focus to other significant contributors to delays.
Charon currently has initial support for Jaeger, which has not been widely used for debugging or production purposes due to the high telemetry traffic it generates. The proposed solution is to start using Grafana Tempo (alongside Prometheus and Loki) as the server for collecting telemetry events in production, while narrowing the tracing scope to the Propose duty only, to minimize traffic.
Under this ticket, we need to revisit the existing Jaeger-specific code and CLI flags, making them universal by adopting the OpenTelemetry library. This library is flexible enough to support tracing with Jaeger, Tempo, and other protocols. Additionally, we will need to eliminate or disable most of the existing tracing for other duties or HTTP calls, or make it conditional.
🛠️ Proposed solution
Change the existing Jeager support to work with Tempo (better - universal way).
Ensure the existing tracing calls are disabled or removed.
Ensure the entire Propose flow is fully covered with sufficient spans to give us the full picture on timings.
Work with Platform team to set up a Tempo server instance for our clients.
Change *CDVN to include Tempo instance and Charon CLI flags to use it.
Test with Kurtosis and Canary clusters.
🧪 Tests
Tested by new automated unit/integration/smoke tests
Manually tested on core team/canary/test clusters
Manually tested on local compose simnet
The text was updated successfully, but these errors were encountered:
🎯 Problem to be solved
To better understand the timing breakdown for the entire proposal duty flow in real clusters (both in test and production), we aim to visualize the time spent on tasks such as querying BN, achieving consensus, and other processes for specific duties or slots. This data will help us investigate missing proposals more quickly and, if necessary, adjust consensus timings appropriately or shift focus to other significant contributors to delays.
Charon currently has initial support for Jaeger, which has not been widely used for debugging or production purposes due to the high telemetry traffic it generates. The proposed solution is to start using Grafana Tempo (alongside Prometheus and Loki) as the server for collecting telemetry events in production, while narrowing the tracing scope to the Propose duty only, to minimize traffic.
Under this ticket, we need to revisit the existing Jaeger-specific code and CLI flags, making them universal by adopting the OpenTelemetry library. This library is flexible enough to support tracing with Jaeger, Tempo, and other protocols. Additionally, we will need to eliminate or disable most of the existing tracing for other duties or HTTP calls, or make it conditional.
🛠️ Proposed solution
🧪 Tests
The text was updated successfully, but these errors were encountered: