Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposer-runner: QBFT liveness choked by Beacon node #1825

Open
iurii-ssv opened this issue Oct 28, 2024 · 1 comment
Open

proposer-runner: QBFT liveness choked by Beacon node #1825

iurii-ssv opened this issue Oct 28, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@iurii-ssv
Copy link
Contributor

iurii-ssv commented Oct 28, 2024

I've been looking into QBFT and proposer duty recently a bit deeper, might be missing some context but I think there is a liveness issue (probably has been discussed in the past, but it's best to document such findings to serve as docs if nothing else) in proposer runner. Namely,

once proposer runner reconstructed partial RANDAO signatures it immediately makes a blocking call to Beacon node (GetBeaconBlock) which can take arbitrary amount of time to finish (production logs sometimes show this request can last 1-3s), it doesn't even matter if there a timeout on this call set or not - there is no reason to block here unless proposer runner is gonna be the leader of qbft consensus that comes next (and even if he is the leader - blocking here is suboptimal as I describe below). The liveness suffers because while proposer runner is blocked he can't participate in qbft at all (so, he prevents himself from executing the next stage of duty even though he might never need this Beacon block data he is waiting for - that is, if he never needs to lead a round, or if he uses justified prepared value when leading round changes instead of his own).

So, instead of blocking on GetBeaconBlock we want to make an asynchronous call (perhaps even with retries, but these would need to be fast) and use the data if/when it comes. I haven't studied the relevant code in detail, but we'll need to start QBFT Instance and block its execution only if the proposer runner turns out to be a leader for Round 1 or any other Rounds that follow (plus we'll need to be able to update Instance.StartValue once Beacon data eventually comes, obviously in concurrently safe manner). This way we will be able to participate in QBFT consensus ASAP, but not as a leader. This is a low hanging fruit. We can do better than that,

I see 2 potential options on what we can do if proposer runner wants to be a leader but can't get data from Beacon node:

  1. leader yielding his round voluntarily: leader, instead of starting qbft with proposal (since he doesn't have a block from Beacon node) will do a sort of round-change - the main difference, I think, is that every node can verify the validity of such round change immediately just by checking initiator signature + that he is indeed the leader (for target slot, and current round) when receiving such message - hence it should be pretty fast to change leader this way, and it should also work when proposer runner is chosen to be the leader of round change (note: we also would need to 1) store it with Instance.State to make sure yielded leader doesn't attempt to propose in yielded round 2) add yieldJustifications to pass along with Propose msg so that leader(s) who took over can justify proposing in the round they weren't meant to)
  2. operators can broadcast blocks fetched from Beacon node (GetBeaconBlock) to other operators in their cluster (those who are expected to perform same proposer duty) as soon as they fetch these blocks, this way operator doesn't 100% depend on Beacon node to provide block and rather he can just take any of the blocks sent by peers in the same cluster and propose these as his own (there are options, but the general idea is: operator can wait certain amount of time for Beacon request to execute, lets say 500ms, and once this time is exceeded operator can try to fallback to the block provided by peer(s)); we will also need to verify peer blocks with ProposedValueCheckF before proposing as our own (to ensure peer isn't suggesting us to sign on the block with a slot we've already signed another block for, thus we can't be slashed; note, we do this check for the block we are getting from Beacon node too - as we should), but in terms of security nothing really changes, I think, and this probably should work both for blinded blocks as well as full data blocks

We can even try to combine both (although, it doesn't seem to be any better than just doing approach 2, if leader can't receive any peer messages with data to propose - it's unlikely he'll do better at sending "yield" messages out; I don't know enough about libp2p to make a conclusion on this), WDYT ?

Btw, it seems other duty types suffer from similar issue (although they probably have more time allocated for execution than proposer-runner)

And to go one step further:

  • (whole another topic, but somewhat relevant) it seems duties are "fragile" in a sense that duty terminates once it encounters any error; this means any "temporary" error (such as loss of network packets between SSV and Beacon nodes, for example) will terminate duty forever rendering operator unavailable to participate in QBFT consensus for this particular duty (and I don't see a mechanism for duty restart - which is one way to address that; but probably addressing each temporary error locally by retrying the offending action in a certain way is just better way to do it, cause we don't want to redo all the work and it's an additional unnecessary latency)
  • (whole another topic, but somewhat relevant) finally, in case of Beacon node outage (not temporary blip) operator goes offline (keeps restarting until beacon node becomes healthy); not sure if we can do better on this front, but as per my current understanding operator could stay online and keep singing off on some but not other duty types (eg. signing off on Ethereum blocks he can do, while add/remove validator events he just can't confirm/deny happened or not) as long as he can check for himself that he won't be committing slashable Ethereum offense
@iurii-ssv iurii-ssv added the enhancement New feature or request label Oct 28, 2024
@iurii-ssv iurii-ssv changed the title proposer-runner: liveness choked by beacon node proposer-runner: QBFT liveness choked by beacon node Oct 29, 2024
@iurii-ssv iurii-ssv changed the title proposer-runner: QBFT liveness choked by beacon node proposer-runner: QBFT liveness choked by Beacon node Oct 29, 2024
@iurii-ssv
Copy link
Contributor Author

iurii-ssv commented Oct 29, 2024

Another related issue that we can solve here (see "low hanging fruit" from the message above) is that in the current implementation (even if Beacon nodes are reasonably fast) QBFT instances start at "slightly" different time. This essentially creates an undesirable effect of round timeout(s) kicking in at different moment for different operators (this has also been mentioned in ssvlabs/SIPs#37). This adversely affects the QBFT consensus that follows (the worse the differences in Instance start time are - the worse it affects QBFT I imagine).

I think this can be solved quite easily just by

  • starting Instance well ahead of time and anchoring that start time to something deterministic (it could be a reconstructed partial RANDAO signatures event, or maybe it's just better to simply start it at exact wall-clock time - something like "duty target slot time -4s" for example)
  • plus giving round 1 a larger timeout value so we can account for fetching data from Beacon node (or receiving it from peer(s) as per 2) from the message above) and initializing Instance start value well before round change for round 1 happens
  • the rest of the rounds should probably have equal timeout value (lower than that of round 1); it feels like this value should be equal to the "expected"(typical) value of (round 1 deadline - got block to propose) but probably additionally we also need to take into account that, for Ethereum proposer, round 2 might have to end prematurely (before it's designated round 2 QFBT timeout) just because it's too late to broadcast signed Ethereum block (which is probably somewhere in the range of 2.5s-4s) thus it might not even make sense to even try and start round 2

There is an open PR with unfinished effort to address something similar, but I'm not 100% sure it covers all the cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant