-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc-alt: metrics #20728
rpc-alt: metrics #20728
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
3 Skipped Deployments
|
Curious why do we have to move the metrics to a separate crate? It's still part of the framework, right? |
In my mind the indexing framework is about writing to the database, so it didn't make sense that the read side would depend on the indexing framework, especially not to use its metrics service, which is a shared component to both (put another way, I wouldn't think to look inside the indexing framework for a metrics service). This PR does not get us there, but I think ideally (when prepared for external folks to use) the indexing framework would just deal with the prometheus |
@amnn would it be possible to update jsonrpsee for the whole project vs using a separate version just for the indexer? I understand there may be additional work to do that but it would be nice if we limited the duplicate deps that we bring into the repository, as well as any metric improvements you're making here may also be beneficial to the existing jsonrpc on-node service. |
Also fwiw, as a part of the new gRPC service i'm trying to take a ground up approach to how we capture metrics for http services to be able to handle some of the edge cases that we've been missing. may be worth syncing here to see if there's some overlap we could share here |
Re: metrics, that would be great -- I would love to have a plug in metrics service that we could use in any of the services we build. That should generally raise the bar for our observability and while I originally put the metrics service in the framework my feeling is now that this should be a separate concern. Let's discuss more in the new year. Re: upgrading jsonrpsee everywhere -- I did anticipate this question and went back and forth on it. Ultimately for me it was a backlog item for three reasons:
For me it's really a toss up between there coming a time when the benefits outweigh the costs to upgrading jsonrpsee for our existing crates and there coming a time where we can just delete the on-node JSONRPC. Even though deletion is realistically a while away, I'm trying hard not to get bogged down on these additional complexities, because it removes advantages of starting from a fresh page. |
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
/// A JSONRPC module implementation coupled with a description of its schema. | ||
pub trait RpcModule: Sized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks this is a wrapper of jsonrpsee
's RpcModule
, with the only difference being the added schema
method. And right now this schema method is used to add a module's schema to RpcService
's schema
. How is this schema going to be used? For generating documentation and/or showing users the available schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trait does two things:
- It creates a standard interface for going from a
T
to anRpcModule<T>
.jsonrpsee
generates aninto_rpc
function, but it's not abstracted behind a single, common interface, which we need for the reader to have an interface where you can register any module. - The added
schema
, as you mentioned. This is used to implement therpc.discover
method.
impl RpcMetrics { | ||
pub(crate) fn new(registry: &Registry) -> Arc<Self> { | ||
Arc::new(Self { | ||
db_latency: register_histogram_with_registry!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's possible/makes sense to have these db metrics labeled by methods as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be hesitant to go down this route because imposing that constraint may make the implementation less efficient -- think of things like the DataLoader pattern where we group together multiple disparate method requests into one DB request and shoot that off to the DB.
In cases where you can associate a DB request with a specific method, you should get a pretty good picture of the cost specific to that method using the request_latency
metric. This can fall apart when the method implementation is sufficiently complicated that it's making multiple DB requests in sequence. The incidence of this should be rare for JSON-RPC, but we might have some examples (e.g. fetch ObjectIDs and then fetch their contents), but those are also the cases where we might employ things like the DataLoader pattern.
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
/// A JSONRPC module implementation coupled with a description of its schema. | ||
pub trait RpcModule: Sized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trait does two things:
- It creates a standard interface for going from a
T
to anRpcModule<T>
.jsonrpsee
generates aninto_rpc
function, but it's not abstracted behind a single, common interface, which we need for the reader to have an interface where you can register any module. - The added
schema
, as you mentioned. This is used to implement therpc.discover
method.
// Add a method to serve the schema to clients. | ||
modules | ||
.register_method("rpc.discover", move |_, _, _| json!(schema.clone())) | ||
.context("Failed to add schema discovery method")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emmazzz, here we are adding the rpc.discover
method that uses the results from calling RpcModule::schema
.
impl RpcMetrics { | ||
pub(crate) fn new(registry: &Registry) -> Arc<Self> { | ||
Arc::new(Self { | ||
db_latency: register_histogram_with_registry!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be hesitant to go down this route because imposing that constraint may make the implementation less efficient -- think of things like the DataLoader pattern where we group together multiple disparate method requests into one DB request and shoot that off to the DB.
In cases where you can associate a DB request with a specific method, you should get a pretty good picture of the cost specific to that method using the request_latency
metric. This can fall apart when the method implementation is sufficiently complicated that it's making multiple DB requests in sequence. The incidence of this should be rare for JSON-RPC, but we might have some examples (e.g. fetch ObjectIDs and then fetch their contents), but those are also the cases where we might employ things like the DataLoader pattern.
a7b71c6
to
636037b
Compare
636037b
to
42b9904
Compare
42b9904
to
a44d664
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets track ™️
/// Collects information about the database connection pool. | ||
pub struct DbConnectionStatsCollector { | ||
db: Db, | ||
desc: Vec<(MetricType, Desc)>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh this is great. thoughts on extending this for metrics like dead tuples per table, storage usage per table, sum storage -> so we can create alerts on these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea -- originally, I was thinking we would get these directly from the DB provider, but yes, we could collect them here! There's a task for that, so let me link to this idea over there.
## Description Track request latencies, success and failure rates, report them through a dedicated metrics endpoint, and report them through tracing as well. ## Test plan New unit tests for request counts and graceful shutdown, and manual testing for tracing: ``` sui$ RUST_LOG=DEBUG cargo run -p -- sui-indexer-alt-jsonrpc 2024-12-21T00:02:46.720775Z INFO sui_indexer_alt_jsonrpc: Starting JSON-RPC service on 0.0.0.0:6000 2024-12-21T00:02:46.720775Z INFO sui_indexer_alt_jsonrpc::metrics: Starting metrics service on 0.0.0.0:9184 2024-12-21T00:02:49.676408Z DEBUG jsonrpsee-server: Accepting new connection 1/100 2024-12-21T00:02:49.691414Z DEBUG sui_indexer_alt_jsonrpc::api: SELECT "kv_epoch_starts"."reference_gas_price" FROM "kv_epoch_starts" ORDER BY "kv_epoch_starts"."epoch" DESC LIMIT $1 -- binds: [1] 2024-12-21T00:02:49.694647Z INFO sui_indexer_alt_jsonrpc::metrics::middleware: Request succeeded method="suix_getReferenceGasPrice" elapsed_ms=1.7868458000000002e-5 2024-12-21T00:03:02.630356Z DEBUG jsonrpsee-server: Accepting new connection 1/100 2024-12-21T00:03:02.630585Z INFO sui_indexer_alt_jsonrpc::metrics::middleware: Request failed method="suix_getReferenceGasPrice_non_existent" code=-32601 elapsed_ms=3.9333e-8 ```
## Description Request metrics are labelled with their method, which is a user input. Users can pollute our metrics by sending methods that the RPC does not support. This change mitigates that by detecting unrecognized methods and normalizing them to an "<UNKNOWN>" label. ## Test plan Updated unit tests: ``` sui$ cargo nextest run -p sui-indexer-alt-jsonrpc ```
## Description Track how long each Database request is taking, how many requests we've issued, and how many have succeeded/failed. ## Test plan Run server, make request, check metrics.
## Description Add an automatic function to advertise the RPC's schema (version, methods, types) to clients. ## Test plan New unit tests: ``` sui$ cargo nextest run -p sui-indexer-alt-jsonrpc ```
## Description The RPC and Indexer implementations use essentially the same metrics service, and when they are combined in a single process they need to use the same instance of that metrics service as well. This change starts that process, by pulling out the implementation from `sui-indexer-alt`. In a later change, the metrics service in the RPC implementation will switch over to this common crate as well. This unblocks running both services together in one process, which is something we will need to do as part of E2E tests. ## Test plan Run the indexer and make sure it can be cancelled (with Ctrl-C), and that it also shuts down cleanly when run to a specific last checkpoint.
## Description ...and use it in the RPC implementation as well. A parameter has been added to add a prefix to all metric names. This ensures that if an Indexer and RPC service are sharing a metrics service, they won't stomp on each others metrics. ## Test plan Run the RPC service and check its metrics.
## Description Make various configs public, mainly so they can be used from the test config. ## Test plan CI
## Description This change adds a new custom command to the transactional test runner for calling a JSON-RPC query. This is supported by a new function on the `OffchainStateReader` trait to execute that JSON-RPC query, which is implemented trivially in the GraphQL transactional tests. ## Test plan CI
## Description Set-up transactional tests for Indexer-Alt and RPC-Alt ## Test plan Run the new transactional tests: ``` sui$ cargo nextest run -p sui-indexer-alt-e2e-tests ```
## Description This used to be part of the interface neded by the transactional test runner to talk to the off chain state of the test cluster, but it is no longer in use (and isn't necessary for `sui-indexer-alt`), so it's easiest to get rid of it. ## Test plan ``` sui$ cargo nextest run \ -p sui-indexer-alt-e2e-tests \ -p sui-graphql-e2e-tests ```
a44d664
to
3f68225
Compare
Description
Add metrics tracking to the RPC service (tracking latencies as well as success/failure/total counts for requests, broken down by method, and DB queries). This required some additional/surrounding changes:
jsonrpsee
(just forsui-indexer-alt-jsonrpc
), which has better support for adding middleware that is aware of a JSON-RPC request's structure. This is used to add a middleware to track request latencies.MetricsService
from the indexing framework, to be used in the reader as well. This was mostly motivated by the fact that in a test cluster or local network, we would need to serve metrics for both the reader and the indexer from the same endpoint -- so we will want to run one instance of this service and register metrics for both the reader and the indexer (this change also simplifies factoring out the ingestion service because we can now factor out its metrics).The changes to introduce RPC discovery were also coupled with this change -- the
rpc.discover
endpoint was added using the existingsui-open-rpc
crate -- the coupling happened because it was useful for the metrics service to know which methods were supported to avoid polluting metrics with unrecognised methods.Test plan
Unit tests were added for rpc discovery, graceful shutdown and metrics:
Stack
Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.