Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFT-465: Twins tests #117

Merged
merged 26 commits into from
Jun 10, 2024
Merged

BFT-465: Twins tests #117

merged 26 commits into from
Jun 10, 2024

Conversation

aakoshh
Copy link
Contributor

@aakoshh aakoshh commented May 24, 2024

What ❔

Add tests that run multiple Twins scenarios with random number of replicas as twins.

  • Add Network::Twins type with a partition schedule
  • Add run_nodes_twins to execute the test with arbitrary network partitions and potentially multiple ports per validator
  • Buffer messages that can't be delivered and deliver them sometime later (potentially unless superseded by a newer message from source to target)
  • Simulate the gossiping of finalized blocks
  • Run test with no twins and no partitions
  • Run test with no twins but random partitions
  • Run test with twins and random partitions

Investigation

cargo nextest run -p zksync_consensus_bft twins_network --no-capture

Initially the test failed when partitions are introduced. I'll try to understand if this is because the leader got isolated and an important message got lost. Would like to understand if eventual delivery is an absolute requirement even if all partitions are at least as large as the quorum size.

🔍 I think the reason for the test failing is because it looks for all nodes having persisted a certain number of blocks, but if a proposal with the payload is missed due to a partition preventing the message from being delivered, then there is no mechanism in the machinery instantiated by the test to procure these later, and those nodes are stuck.

🔧 I implemented a message stashing mechanism in the mock network but it still doesn't finalise blocks 👀

🔍 The problem seems to be that my unstashing mechanism only kicked in when node A tried to send a new message to node B and wasn't blocked any more. However if A didn't try to send, then the stashed messages to B weren't delivered, which causes longer and longer timeouts as nobody is making progress for one reason or another. Meanwhile for example C can be already in a new view, so if we see that, we could conclude that A-to-B should be unstashed as well even if there are no messages from A to B in that round.

🔧 I'll try testing after merging #119 which should trigger unstashes in each round.

🔍 It's still failing after adding the replica-to-replica broadcast. For example one replica A is isolated in a round 1 and doesn't get a LeaderCommit; then in the next round replica B is isolated, and A gets all the missing messages from round 1, plus the new LeaderPrepare, but it doesn't respond with a ReplicaCommit because it doesn't have the block from round 1, and therefore cannot even store proposal 2. The consensus relies on the external gossiping mechanism to eventually propagate the FinalBlock to it; until then the node is stuck. I need to simulate the effect of gossiping in the test.

🔧 I implemented a simulated gossip using the following mechanism: if one node is about to send/unstash a message to another and they are in a gossip relationship, and the message contains a CommitQC, and the sender has the finalized block for the committed height, then it inserts the block directly into the target store.

🔍 The tests now work without twins, but fail with 1 or 2 twins, albeit not on every scenario

🔧 Changed the simulated gossip to push all ancestors of a finalized block into the target blockstore, not just the one in the CommitQC that is in the latest message. This simulates the ability of the target to fetch all missing blocks.

Why ❔

To check that the consensus does not fail as long as the number of twins does not exceed the tolerable number of Byzantine nodes.

Base automatically changed from 464-twins-gen to main May 28, 2024 08:23
@aakoshh aakoshh force-pushed the bft-465-twins-test branch from ec75578 to ae67132 Compare May 28, 2024 08:24
@aakoshh aakoshh force-pushed the bft-465-twins-test branch from 93202ea to 43fb874 Compare May 28, 2024 13:13
@aakoshh aakoshh force-pushed the bft-465-twins-test branch from 5e1db08 to 6360c12 Compare May 28, 2024 20:02
@aakoshh aakoshh force-pushed the bft-465-twins-test branch from 58d42d9 to 9c06ad4 Compare May 29, 2024 10:31
@aakoshh aakoshh changed the title BFT-465: Skeleton for Twins tests BFT-465: Twins tests May 29, 2024
@aakoshh aakoshh force-pushed the bft-465-twins-test branch from d926088 to 76cfc9c Compare May 30, 2024 12:25
@aakoshh aakoshh force-pushed the bft-465-twins-test branch from 76cfc9c to 78b81f2 Compare May 30, 2024 13:40
@aakoshh aakoshh marked this pull request as ready for review May 30, 2024 14:35
@aakoshh aakoshh force-pushed the bft-465-twins-test branch from a818fb8 to ea10b1d Compare May 30, 2024 18:10
Copy link
Member

@brunoffranca brunoffranca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! As a side-note, really appreciate the quality of the documentation in the code and PR.

node/actors/bft/src/testonly/run.rs Show resolved Hide resolved
node/actors/bft/src/testonly/run.rs Outdated Show resolved Hide resolved
@brunoffranca brunoffranca merged commit 90b893c into main Jun 10, 2024
5 checks passed
@brunoffranca brunoffranca deleted the bft-465-twins-test branch June 10, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants