-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposer-runner: QBFT liveness choked by Beacon node #1825
Comments
Another related issue that we can solve here (see "low hanging fruit" from the message above) is that in the current implementation (even if Beacon nodes are reasonably fast) QBFT instances start at "slightly" different time. This essentially creates an undesirable effect of round timeout(s) kicking in at different moment for different operators (this has also been mentioned in ssvlabs/SIPs#37). This adversely affects the QBFT consensus that follows (the worse the differences in Instance start time are - the worse it affects QBFT I imagine). I think this can be solved quite easily just by
There is an open PR with unfinished effort to address something similar, but I'm not 100% sure it covers all the cases. |
I've been looking into QBFT and proposer duty recently a bit deeper, might be missing some context but I think there is a liveness issue (probably has been discussed in the past, but it's best to document such findings to serve as docs if nothing else) in proposer runner. Namely,
once proposer runner
reconstructed partial RANDAO signatures
it immediately makes a blocking call to Beacon node (GetBeaconBlock
) which can take arbitrary amount of time to finish (production logs sometimes show this request can last 1-3s), it doesn't even matter if there a timeout on this call set or not - there is no reason to block here unless proposer runner is gonna be the leader of qbft consensus that comes next (and even if he is the leader - blocking here is suboptimal as I describe below). The liveness suffers because while proposer runner is blocked he can't participate in qbft at all (so, he prevents himself from executing the next stage of duty even though he might never need this Beacon block data he is waiting for - that is, if he never needs to lead a round, or if he uses justified prepared value when leading round changes instead of his own).So, instead of blocking on
GetBeaconBlock
we want to make an asynchronous call (perhaps even with retries, but these would need to be fast) and use the data if/when it comes. I haven't studied the relevant code in detail, but we'll need to start QBFTInstance
and block its execution only if the proposer runner turns out to be a leader for Round 1 or any other Rounds that follow (plus we'll need to be able to updateInstance.StartValue
once Beacon data eventually comes, obviously in concurrently safe manner). This way we will be able to participate in QBFT consensus ASAP, but not as a leader. This is a low hanging fruit. We can do better than that,I see 2 potential options on what we can do if proposer runner wants to be a leader but can't get data from Beacon node:
Instance.State
to make sure yielded leader doesn't attempt to propose in yielded round 2) addyieldJustifications
to pass along with Propose msg so that leader(s) who took over can justify proposing in the round they weren't meant to)GetBeaconBlock
) to other operators in their cluster (those who are expected to perform same proposer duty) as soon as they fetch these blocks, this way operator doesn't 100% depend on Beacon node to provide block and rather he can just take any of the blocks sent by peers in the same cluster and propose these as his own (there are options, but the general idea is: operator can wait certain amount of time for Beacon request to execute, lets say 500ms, and once this time is exceeded operator can try to fallback to the block provided by peer(s)); we will also need to verify peer blocks withProposedValueCheckF
before proposing as our own (to ensure peer isn't suggesting us to sign on the block with a slot we've already signed another block for, thus we can't be slashed; note, we do this check for the block we are getting from Beacon node too - as we should), but in terms of security nothing really changes, I think, and this probably should work both for blinded blocks as well as full data blocksWe can even try to combine both (although, it doesn't seem to be any better than just doing approach 2, if leader can't receive any peer messages with data to propose - it's unlikely he'll do better at sending "yield" messages out; I don't know enough about libp2p to make a conclusion on this), WDYT ?
Btw, it seems other duty types suffer from similar issue (although they probably have more time allocated for execution than proposer-runner)
And to go one step further:
The text was updated successfully, but these errors were encountered: