Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CosmosChainProcessor - dynamic block time targeting #847

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

agouin
Copy link
Member

@agouin agouin commented Jul 14, 2022

Adds block query targeting mechanism, accounting for clock drift and variable RPC node time from the block timestamp until blocks are ready to be queried. This removes the fixed 1 second minQueryLoopDuration to reduce queries by holding a rolling average of the delta time between blocks. This will allow it to target block query times on chains with different consensus timeouts (block times). In addition, it compares the block timestamp against the timestamp when the queries were initiated. It holds a clock drift parameter for fine tuning the time when the next block queries will be initiated. This reduces the queries on the nodes, cleans up the logs, and optimizes so that it can capture blocks as soon as they are ready to be queried.

@agouin agouin changed the title CosmosChainProcessor CosmosChainProcessor - query timeout backoff Jul 14, 2022
@agouin agouin changed the title CosmosChainProcessor - query timeout backoff CosmosChainProcessor - query timeout backoff retry Jul 14, 2022
@agouin agouin marked this pull request as ready for review July 15, 2022 00:02
@agouin agouin changed the title CosmosChainProcessor - query timeout backoff retry CosmosChainProcessor - query timeout backoff and block time targeting Jul 15, 2022
@agouin agouin changed the title CosmosChainProcessor - query timeout backoff and block time targeting CosmosChainProcessor - query timeout and block time targeting Jul 15, 2022
Copy link
Member

@mark-rushakoff mark-rushakoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of subtlety around the timing here. Do you think there is a way to capture any of this behavior in tests?

I also don't think "clock drift" is the correct description of what is being modified here, because we are not trying to synchronize a local wall clock with a remote one, IIUC -- rather we are trying to match an irregular timing to an unpredictable remote state. Maybe "backoff" would be slightly more accurate?

Copy link
Contributor

@DavidNix DavidNix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 tests would help. I think this is a nice feature especially to help with log spam. I would entertain its own package, maybe internal somewhere. I bet there are other consumers (penumbra?) that would appreciate it.

// succeeds, clock drift should be removed, but not as much as is added for the error
// case. Clock drift should also be added when checking the latest height and it has
// not yet incremented.
func (p *queryCyclePersistence) addClockDriftMs(ms int64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Go, I always pause if a duration is not type time.Duration. Other scalar types are often misinterpreted. You can easily get ms from (time.Duration).Milliseconds()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using int64 for averageBlockTimeMs and timeTrimMs resulted in the least amount of conversions to/from time.Duration, but I don't have a major opinion as opposed to using time.Duration and having more conversions. Will update

relayer/chains/cosmos/cosmos_chain_processor.go Outdated Show resolved Hide resolved
@DavidNix
Copy link
Contributor

One additional question, is there ever a risk of missing a block? E.g. Waiting too long?

@agouin
Copy link
Member Author

agouin commented Jul 15, 2022

One additional question, is there ever a risk of missing a block? E.g. Waiting too long?

If we wait too long, it will see that multiple blocks need to be queried and query them in sequence. It always starts at persistence.latestQueriedBlock + 1, and persistence.latestQueriedBlock is only updated when a block is successfully processed.

@agouin
Copy link
Member Author

agouin commented Jul 15, 2022

I also don't think "clock drift" is the correct description of what is being modified here, because we are not trying to synchronize a local wall clock with a remote one, IIUC -- rather we are trying to match an irregular timing to an unpredictable remote state. Maybe "backoff" would be slightly more accurate?

It's a trim value that is the sum of both clock drift, comparing the block's consensus timestamp against local machine time, and the variable amount of time from then until the block is ready to be queried. So yes maybe backoff or timeTrim would be a better name.

@agouin agouin marked this pull request as draft July 15, 2022 23:05
@agouin agouin force-pushed the andrew/cosmoschainprocessor_big_blocks branch from ff458d0 to 6e36e8b Compare July 18, 2022 17:41
@agouin agouin changed the title CosmosChainProcessor - query timeout and block time targeting CosmosChainProcessor - dynamic block time targeting Jul 18, 2022
// target ideal block query window.

// Time trim addition when a block query fails
queryFailureTimeTrimAdditionMs = 73
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to hear how you arrived at these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used prime numbers (to avoid fixed multiples) that would allow a steady backoff in the case of errors (up to a limit), and a smaller number for successes to fine tune the window when blocks are expected to be available. This allows it to find the ideal window within ~15 blocks and then remain there. Outlier scenarios like one-off blocks that take an unexpectedly long amount of time to become available will not derail the targeting.

@jackzampolin
Copy link
Member

Was just talking to @boojamya about updating clients on channels w/ very little traffic today (i.e. make sure the client gets updated at least once per trusting period) and we had a need for estimated block time. Are we persisting this anywhere? Also this is a cool feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants