Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic QBFT Timeout #37

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions sips/dynamic_qbft_timeout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
| Author | Title | Category | Status |
| -------------- | --------------------- | -------- | -------- |
| Gal Rogozinski | QBFT timeout | Core | approved |


**Summary**
Changes to the constant timeouts we defined in [SIP#6](https://github.com/bloxapp/SIPs/blob/main/sips/constant_qbft_timeout.md) that should improve performance.

**Rational**
Currently for all duties, we have a `quickTimeout` of 2 seconds before sending a `RoundChange` message and a `slowTimeout` of 2 minutes starting from the 8th round.The timer starts when the QBFT instance starts. This approach has 2 issues:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify btw why there is a need for slowTimeout ? Is it just purely for qbft or do we have anything in SSV node that would make use of such long timeout (I think most/all duties must finish execution much faster or just will be missed)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iurii-ssv It was introduced in https://github.com/ssvlabs/SIPs/pull/22/files, attester duties may be submitted within 32 slots/6.4m

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is simply wait for a very long time if after 8 rounds you didn't reach consensus. This is because we are probably in a very high latency environment

Copy link

@iurii-ssv iurii-ssv Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is simply wait for a very long time if after 8 rounds you didn't reach consensus. This is because we are probably in a very high latency environment

right, I understand the intent, just trying to figure out how exactly it applies/fits to duties,

in particular I've also left a related comment here

@iurii-ssv It was introduced in https://github.com/ssvlabs/SIPs/pull/22/files, attester duties may be submitted within 32 slots/6.4m

yup, true

at any rate, I always thought the spec should define an interface (not an implementation) everybody wants to conform to and implement it as they see fit for their use-case/application, meaning:

  • QBFT shouldn't have to know anything about duty(s), it should be the other way around - duty configures QBFT Instance to work as it needs to; concretely it means, for example, that qbft/timeout.go file shouldn't import (know about) any ethereum/ssv-related stuff - currently it does import it to get the access to func(s) that tell it how long a slot lasts etc. (in fact, ideally it doesn't even need to know what a slot is - such parameters should bear names from QBFT domain and duty processor would fill these in as needed)
  • different duty types should have different total round count, round 1 timeout, round 2 timeout, ... (seems like these values should not be constants, but rather vars - so it doesn't send the wrong message that "these values are uniform for everyone")

It's not really an issue to worry about, just trying to get some consistency (and knowledge) if possible.


1. Duties that require a pre-consensus stage may start their instances at different times, causing some nodes to timeout sooner than others. This will result in a greater probability of a failed first round.
2. Different duties have different timing assumptions, which may cause some duties to execute earlier than others.



**Beacon Duty Timing Assumptions**
The Beacon chain has some timing assumptions regarding duty execution for maximizing rewards.

| Duty | Timing Assumption | Dependency | Description |
| -------------- | ----------------- | -------------- | ------------------------------------------------------------------------------------------------------------------- |
| Proposal | < 4 Sec | --- | Attestations wait 4 seconds for a block |
| Attester | < 4 Sec | Proposal | Attestations wait 4 seconds for a block then need to be propogated within 4 seconds for aggregators to pick them up |
| Aggregator | < 4 Sec | Attester | Aggregators start aggregating within 8 seconds from slot start and need to propogate within 4 seconds |
| Sync Committee | < 4 sec | Proposal | Sync Committee wait 4 seconds for a block |
| Contribution | < 4 sec | Sync Committee | Contributions wait 8 seconds from the slot start |



**Specification**

From the slot start time after `baseDuration` delay wait 2 seconds before emitting a `RoundChange` message until the 6th round (inclusive). Afterwards, wait 2 minutes.

```go
var (
// quickTimeoutThreshold is the round after which the timeout duration increases to 2 minutes
quickTimeoutThreshold = Round(6)
// quickTimeout is the timeout in seconds for the first 6 rounds
quickTimeout int64 = 2 // 2 seconds

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO time.Duration is more readable

Suggested change
quickTimeout int64 = 2 // 2 seconds
quickTimeout time.Duration = 2 * time.Second

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to move away from using go libs in SIPs
The idea is that the spec shouldn't be for helping implement a node in any language

Also for future SIPs I want to start using python like pseudo-code

Copy link

@iurii-ssv iurii-ssv Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my 2 cents,

I would say while we still use Golang might as well do it as Golang prefers to do it (as Nikita mentioned), but the idea to use pseudo-code is the right one ofc (using Python is probably not ideal, but probably better than golang for this kind of thing, as long as we keep away from using advanced Python-specific syntax),

the even more important thing though is that spec "provides interface and not implementation" - I elaborate a bit on it here

Copy link
Contributor Author

@GalRogozinski GalRogozinski Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as we keep away from using advanced Python-specific syntax

That's why I said "python like pseudo code" :-)

spec "provides interface and not implementation"

Currently spec has too many implementation details including full fledged consensus impl.
This is due to legacy reasons and how the project was inherited. We are planning a Markdown spec.
All implementations that we will have will just be used for creating tests.

// slowTimeout is the timeout in seconds for rounds after the first 6
slowTimeout int64 = 120 // 2 minutes
)

// RoundTimeout returns the unix epoch time (seconds) in which we should send a RC message
// Called for all beacon duties other than proposals
func RoundTimeout(round Round, height Height, baseDuration int64, network types.BeaconNetwork) int64 {
// Calculate additional timeout in seconds based on round
var additionalTimeout int64
additionalTimeout = int64(min(round, quickTimeoutThreshold)) * quickTimeout
if round > quickTimeoutThreshold {
slowPortion := int64(round-quickTimeoutThreshold) * slowTimeout
additionalTimeout += slowPortion
}

// Combine base duration and additional timeout
timeoutDuration := baseDuration + additionalTimeout

// Get the unix epoch start time of the duty seconds
dutyStartTime := network.EstimatedTimeAtSlot(phase0.Slot(height))

// Calculate the time until the duty should start plus the timeout duration
return dutyStartTime + timeoutDuration
}
```

*Proposals*

No changes from [SIP#6](https://github.com/bloxapp/SIPs/blob/main/sips/constant_qbft_timeout.md). Doesn't rely on the slot/duty start time.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we want to change this?
The only issue I see is that we have baseDuration=0 andf preconsensus stage might be slow, but not sure if it is so bad...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the issue is that if precon finished at a late time the first RC message should be for a later round?

Same issue with Aggregator duty (but very unlikely)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds to me like we do need to do something with proposals. , a dynamic timeout will solve an issue where some nodes might receive the block at different times (different beacon nodes) and then start and end the rounds on different times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moshe-blox

  1. We start the timer at the duty start time
  2. We need to configure the timeout parameters to something reasonable
  3. If you are not the leader of the first round then prepare a full block for the next rounds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moshe-blox
The issue is for selecting reasonable timeout parameters, if the leader of the first round is bad and timeout is too slow we risk proposing the block late

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proposals won't be considered in this SIP


*Attestations/Sync Committees*

Once the slot starts, wait 4 seconds for a block to be produced then start the timer according to the new rules.

```go
// AttestationOrSyncCommitteeTimeout returns the unix epoch time (seconds) in which we should send a RC message
func AttestationOrSyncCommitteeTimeout(round Round, height Height, network types.BeaconNetwork) int64 {
return RoundTimeout(round, height, 4, network)
}
```

*Aggregator/Contribution*

Once the slot starts, wait 8 seconds for a block to be produced then start the timer according to the new rules

```go
// AggregationOrContributionTimeout returns the unix epoch time (seconds) in which we should send a RC message
func AggregationOrContributionTimeout(r Round, height Height, network types.BeaconNetwork) int64 {
return RoundTimeout(r, height, 8, network)
}
```