-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EFM] Fix bugs from Benchnet testing #6898
base: feature/efm-recovery
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/efm-recovery #6898 +/- ##
========================================================
+ Coverage 41.12% 41.13% +0.01%
========================================================
Files 2123 2123
Lines 187163 187163
========================================================
+ Hits 76971 76994 +23
+ Misses 103739 103723 -16
+ Partials 6453 6446 -7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
storage/badger/operation/dkg.go
Outdated
@@ -110,6 +111,7 @@ func MigrateDKGEndStateFromV1() func(txn *badger.Txn) error { | |||
ops = append(ops, | |||
UpsertDKGStateForEpoch(epochCounter, newState), | |||
remove(makePrefix(codeDKGEndState, epochCounter))) | |||
log.Debug().Msgf("removing %d->%d, writing %d->%d", epochCounter, oldState, epochCounter, newState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Debug().Msgf("removing %d->%d, writing %d->%d", epochCounter, oldState, epochCounter, newState) | |
log.Info().Msgf("removing %d->%d, writing %d->%d", epochCounter, oldState, epochCounter, newState) |
This doesn't happen often so maybe info is better to understand if we have migrated or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
cmd/consensus/main.go
Outdated
return fmt.Errorf("could not migrate DKG end state from v1 to v2: %w", err) | ||
} | ||
log.Debug().Msgf("completed migrating DKG end state from v1 to v2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making it Info
and moving it into the method itself and log only if we have actually migrated. Additionally we could have a debug msg which shows that we have entered and exited PostInit
itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few stylistic suggestions , otherwise looks good.
@@ -284,8 +284,13 @@ func (e *ReactorEngine) handleEpochCommittedPhaseStarted(currentEpochCounter uin | |||
|
|||
return | |||
} | |||
if currentState == flow.RandomBeaconKeyCommitted { | |||
// this can happen if a healthy node which succeeded the DKG restarts in the EpochCommit phase | |||
log.Debug().Msg("checking beacon key consistency after EpochCommit: observed committed beacon key for most recent dkg - exiting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only for better traces? Functionality it doesn't change anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is for more accurate logs - there is no functional change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does change something, because it shortcuts the sanity check that all (repeated) calls to committing the same epoch don't have inconsistent information. See my comment below for details.
Co-authored-by: Yurii Oleksyshyn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. One minor change request regarding a sanity check that I prefer not to shortcut
if currentState != flow.DKGStateCompleted { | ||
log.Warn().Msgf("checking beacon key consistency: exiting because dkg didn't reach completed state: %s", currentState.String()) | ||
if currentState == flow.RandomBeaconKeyCommitted { | ||
// this can happen if a healthy node which succeeded the DKG restarts in the EpochCommit phase | ||
log.Debug().Msg("checking beacon key consistency after EpochCommit: observed committed beacon key for most recent dkg - exiting") | ||
} else { | ||
log.Warn().Msgf("checking beacon key consistency after EpochCommit: exiting because dkg didn't reach completed state: %s", currentState.String()) | ||
} | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For high-assurance systems I am a big fan of idempotent operations, among others for the reasons that they are a superb way of detecting bugs and inconsistencies. Specifically here, what we ideally want is (i) if I have a key which is consistent with the EpochCommit service event, then commit the key or (ii) if the key is already committed, then we expect it to be consistent with the EpochCommit service event - otherwise we have a severe problem.
With the updated logic, we are entirely shortcutting the sanity check (ii). In an ideal world, where code doesn't have bugs, we wouldn't need (ii) ... but unfortunately that's not the case. If a nodes signs with a wrong random beacon key, then it will be slashed, which is a severe loss for the node operator.
I am not saying we should always do all the checks everywhere, but I still think we should in cases where it is easy to and doesn't impact code complexity or runtime - which I think is the case here:
- when the node is up,
handleEpochCommittedPhaseStarted
is called at most once per epoch - a repeated call of
handleEpochCommittedPhaseStarted
only happens after a restart, which is rare and doing a bit of extra work during a restart is negligible overall CommitMyBeaconPrivateKey
is already implemented as an idempotent operation - I asked Yurii to make it idempotent, because it was only marginal extra work, while it provides the opportunity for super simple consistency checks in upstream code (like here)
if currentState != flow.DKGStateCompleted { | |
log.Warn().Msgf("checking beacon key consistency: exiting because dkg didn't reach completed state: %s", currentState.String()) | |
if currentState == flow.RandomBeaconKeyCommitted { | |
// this can happen if a healthy node which succeeded the DKG restarts in the EpochCommit phase | |
log.Debug().Msg("checking beacon key consistency after EpochCommit: observed committed beacon key for most recent dkg - exiting") | |
} else { | |
log.Warn().Msgf("checking beacon key consistency after EpochCommit: exiting because dkg didn't reach completed state: %s", currentState.String()) | |
} | |
return | |
} | |
// (i) if I have a key (currentState == flow.DKGStateCompleted) which is consistent with the EpochCommit service event, | |
// then commit the key or (ii) if the key is already committed (currentState == flow.RandomBeaconKeyCommitted), then we | |
// expect it to be consistent with the EpochCommit service event. While (ii) is a sanity check, we have a severe problem | |
// if it is violated, because a node signing with an invalid Random beacon key will be slashed - so we better check! | |
// Our logic for committing a key is idempotent: it is a no-op when stating that a key `k` should be committed that previously | |
// has already been committed; while it errors if `k` is different from the previously-committed key. In other words, the | |
// sanity check (ii) is already included in the happy-path logic for (i). So we just repeat the happy-path logic also for | |
// currentState == flow.RandomBeaconKeyCommitted, because repeated calls only occur due to node crashes, which are rare. | |
if !(currentState == flow.DKGStateCompleted || currentState == flow.RandomBeaconKeyCommitted) { | |
log.Warn().Msgf("checking beacon key consistency after EpochCommit: exiting because dkg didn't reach completed state: %s", currentState.String()) | |
return | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the updated logic, we are entirely shortcutting the sanity check (ii)
This PR does not change whether the sanity check is run.
The outer conditional is unchanged; within the conditional we log a more informative message in the case where we already completed the final step in the state machine. Regardless, if we enter the conditional at all, we return without proceeding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the sanity check in 04bd4a6.
log.Debug().Msg("checking beacon key consistency after EpochCommit: observed committed beacon key for most recent dkg - exiting") | ||
} else { | ||
log.Warn().Msgf("checking beacon key consistency after EpochCommit: exiting because dkg didn't reach completed state: %s", currentState.String()) | ||
} | ||
return | ||
} | ||
snapshot := e.State.AtBlockID(firstBlock.ID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR includes some improvements from the branch being used for Benchnet Testing: https://github.com/onflow/flow-go/compare/jord/phcu-sv2.
In particular: