Skip to content
This repository has been archived by the owner on Sep 21, 2023. It is now read-only.

pkg/operator: don't erroneously "update" (kill) unhealthy active node #345

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cpick
Copy link

@cpick cpick commented Sep 14, 2018

Previously, any node whose health couldn't be queried by
Vaults.updateLocalVaultCRStatus() would be removed from the standby, sealed,
and updated lists of nodes (so long as at least one other node could be reached
and was healthy aka changed == true).

Thus, if the active node could not be reached and determined healthy it would be
removed from VaultServiceStatus.UpdatedNodes, but would remain
VaultServiceStatus.VaultStatus.Active.

Later, this would cause Vaults.syncUpgrade() to determine that the active node
was the only non-updated node and then kill it to "complete" the update it
assumed was in progress.

Keep note of which nodes have actually been updated irrespective of whether
they're reachable and healthy to prevent this issue.

Fixes #344

Chris Pick added 2 commits September 14, 2018 11:46
More fully describe and simplify the tests `Vaults.syncUpgrade()` uses to
determine whether it should trigger the active node to step down.  This
will hopefully make them easier to understand without any behavioral
changes.

Log when the active node has been forced to step down.  Making it easier
to follow the operator's actions.
Previously, any node whose health couldn't be queried by
`Vaults.updateLocalVaultCRStatus()` would be removed from the standby, sealed,
and updated lists of nodes (so long as at least one other node could be reached
and was healthy aka `changed == true`).

Thus, if the active node could not be reached and determined healthy it would be
removed from `VaultServiceStatus.UpdatedNodes`, but would remain
`VaultServiceStatus.VaultStatus.Active`.

Later, this would cause `Vaults.syncUpgrade()` to determine that the active node
was the only non-updated node and then kill it to "complete" the update it
assumed was in progress.

Keep note of which nodes have actually been updated irrespective of whether
they're reachable and healthy to prevent this issue.

Fixes coreos#344
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant