Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EFM] Fix bugs from Benchnet testing #6898

Open
wants to merge 11 commits into
base: feature/efm-recovery
Choose a base branch
from

Conversation

jordanschalm
Copy link
Member

@jordanschalm jordanschalm commented Jan 15, 2025

This PR includes some improvements from the branch being used for Benchnet Testing: https://github.com/onflow/flow-go/compare/jord/phcu-sv2.

In particular:

  • fixes a bug with DKG state migration, where the migration was applied to the wrong database (also adds debug logging of migration completion and migrated keys)
  • in the Reactor Engine, handles the case where a node which successfully completed the DKG restarts during the EpochCommit phase (prior logic was safe, but produced a misleading log)

@codecov-commenter
Copy link

codecov-commenter commented Jan 16, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 41.13%. Comparing base (db5a8d7) to head (21f9688).

Files with missing lines Patch % Lines
cmd/consensus/main.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@                   Coverage Diff                    @@
##           feature/efm-recovery    #6898      +/-   ##
========================================================
+ Coverage                 41.12%   41.13%   +0.01%     
========================================================
  Files                      2123     2123              
  Lines                    187163   187163              
========================================================
+ Hits                      76971    76994      +23     
+ Misses                   103739   103723      -16     
+ Partials                   6453     6446       -7     
Flag Coverage Δ
unittests 41.13% <80.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jordanschalm jordanschalm marked this pull request as ready for review January 16, 2025 22:11
@jordanschalm jordanschalm requested a review from a team as a code owner January 16, 2025 22:11
@@ -110,6 +111,7 @@ func MigrateDKGEndStateFromV1() func(txn *badger.Txn) error {
ops = append(ops,
UpsertDKGStateForEpoch(epochCounter, newState),
remove(makePrefix(codeDKGEndState, epochCounter)))
log.Debug().Msgf("removing %d->%d, writing %d->%d", epochCounter, oldState, epochCounter, newState)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Debug().Msgf("removing %d->%d, writing %d->%d", epochCounter, oldState, epochCounter, newState)
log.Info().Msgf("removing %d->%d, writing %d->%d", epochCounter, oldState, epochCounter, newState)

This doesn't happen often so maybe info is better to understand if we have migrated or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

return fmt.Errorf("could not migrate DKG end state from v1 to v2: %w", err)
}
log.Debug().Msgf("completed migrating DKG end state from v1 to v2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making it Info and moving it into the method itself and log only if we have actually migrated. Additionally we could have a debug msg which shows that we have entered and exited PostInit itself

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@durkmurder durkmurder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few stylistic suggestions , otherwise looks good.

@@ -284,8 +284,13 @@ func (e *ReactorEngine) handleEpochCommittedPhaseStarted(currentEpochCounter uin

return
}
if currentState == flow.RandomBeaconKeyCommitted {
// this can happen if a healthy node which succeeded the DKG restarts in the EpochCommit phase
log.Debug().Msg("checking beacon key consistency after EpochCommit: observed committed beacon key for most recent dkg - exiting")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only for better traces? Functionality it doesn't change anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is for more accurate logs - there is no functional change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does change something, because it shortcuts the sanity check that all (repeated) calls to committing the same epoch don't have inconsistent information. See my comment below for details.

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One minor change request regarding a sanity check that I prefer not to shortcut

Comment on lines 287 to 295
if currentState != flow.DKGStateCompleted {
log.Warn().Msgf("checking beacon key consistency: exiting because dkg didn't reach completed state: %s", currentState.String())
if currentState == flow.RandomBeaconKeyCommitted {
// this can happen if a healthy node which succeeded the DKG restarts in the EpochCommit phase
log.Debug().Msg("checking beacon key consistency after EpochCommit: observed committed beacon key for most recent dkg - exiting")
} else {
log.Warn().Msgf("checking beacon key consistency after EpochCommit: exiting because dkg didn't reach completed state: %s", currentState.String())
}
return
}
Copy link
Member

@AlexHentschel AlexHentschel Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For high-assurance systems I am a big fan of idempotent operations, among others for the reasons that they are a superb way of detecting bugs and inconsistencies. Specifically here, what we ideally want is (i) if I have a key which is consistent with the EpochCommit service event, then commit the key or (ii) if the key is already committed, then we expect it to be consistent with the EpochCommit service event - otherwise we have a severe problem.

With the updated logic, we are entirely shortcutting the sanity check (ii). In an ideal world, where code doesn't have bugs, we wouldn't need (ii) ... but unfortunately that's not the case. If a nodes signs with a wrong random beacon key, then it will be slashed, which is a severe loss for the node operator.

I am not saying we should always do all the checks everywhere, but I still think we should in cases where it is easy to and doesn't impact code complexity or runtime - which I think is the case here:

  • when the node is up, handleEpochCommittedPhaseStarted is called at most once per epoch
  • a repeated call of handleEpochCommittedPhaseStarted only happens after a restart, which is rare and doing a bit of extra work during a restart is negligible overall
  • CommitMyBeaconPrivateKey is already implemented as an idempotent operation - I asked Yurii to make it idempotent, because it was only marginal extra work, while it provides the opportunity for super simple consistency checks in upstream code (like here)
Suggested change
if currentState != flow.DKGStateCompleted {
log.Warn().Msgf("checking beacon key consistency: exiting because dkg didn't reach completed state: %s", currentState.String())
if currentState == flow.RandomBeaconKeyCommitted {
// this can happen if a healthy node which succeeded the DKG restarts in the EpochCommit phase
log.Debug().Msg("checking beacon key consistency after EpochCommit: observed committed beacon key for most recent dkg - exiting")
} else {
log.Warn().Msgf("checking beacon key consistency after EpochCommit: exiting because dkg didn't reach completed state: %s", currentState.String())
}
return
}
// (i) if I have a key (currentState == flow.DKGStateCompleted) which is consistent with the EpochCommit service event,
// then commit the key or (ii) if the key is already committed (currentState == flow.RandomBeaconKeyCommitted), then we
// expect it to be consistent with the EpochCommit service event. While (ii) is a sanity check, we have a severe problem
// if it is violated, because a node signing with an invalid Random beacon key will be slashed - so we better check!
// Our logic for committing a key is idempotent: it is a no-op when stating that a key `k` should be committed that previously
// has already been committed; while it errors if `k` is different from the previously-committed key. In other words, the
// sanity check (ii) is already included in the happy-path logic for (i). So we just repeat the happy-path logic also for
// currentState == flow.RandomBeaconKeyCommitted, because repeated calls only occur due to node crashes, which are rare.
if !(currentState == flow.DKGStateCompleted || currentState == flow.RandomBeaconKeyCommitted) {
log.Warn().Msgf("checking beacon key consistency after EpochCommit: exiting because dkg didn't reach completed state: %s", currentState.String())
return
}

Copy link
Member Author

@jordanschalm jordanschalm Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the updated logic, we are entirely shortcutting the sanity check (ii)

This PR does not change whether the sanity check is run.

The outer conditional is unchanged; within the conditional we log a more informative message in the case where we already completed the final step in the state machine. Regardless, if we enter the conditional at all, we return without proceeding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the sanity check in 04bd4a6.

log.Debug().Msg("checking beacon key consistency after EpochCommit: observed committed beacon key for most recent dkg - exiting")
} else {
log.Warn().Msgf("checking beacon key consistency after EpochCommit: exiting because dkg didn't reach completed state: %s", currentState.String())
}
return
}
snapshot := e.State.AtBlockID(firstBlock.ID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to move this right before line 302

nextDKG, err := snapshot.Epochs().Next().DKG()

In my head, querying the snapshot belongs to the code block around line 302 much more than it belongs to the exit condition ... just personal preference - feel free to ignore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants