pkg/operator: don't erroneously "update" (kill) unhealthy active node #345

cpick · 2018-09-14T16:24:33Z

Previously, any node whose health couldn't be queried by
Vaults.updateLocalVaultCRStatus() would be removed from the standby, sealed,
and updated lists of nodes (so long as at least one other node could be reached
and was healthy aka changed == true).

Thus, if the active node could not be reached and determined healthy it would be
removed from VaultServiceStatus.UpdatedNodes, but would remain
VaultServiceStatus.VaultStatus.Active.

Later, this would cause Vaults.syncUpgrade() to determine that the active node
was the only non-updated node and then kill it to "complete" the update it
assumed was in progress.

Keep note of which nodes have actually been updated irrespective of whether
they're reachable and healthy to prevent this issue.

Fixes #344

More fully describe and simplify the tests `Vaults.syncUpgrade()` uses to determine whether it should trigger the active node to step down. This will hopefully make them easier to understand without any behavioral changes. Log when the active node has been forced to step down. Making it easier to follow the operator's actions.

Previously, any node whose health couldn't be queried by `Vaults.updateLocalVaultCRStatus()` would be removed from the standby, sealed, and updated lists of nodes (so long as at least one other node could be reached and was healthy aka `changed == true`). Thus, if the active node could not be reached and determined healthy it would be removed from `VaultServiceStatus.UpdatedNodes`, but would remain `VaultServiceStatus.VaultStatus.Active`. Later, this would cause `Vaults.syncUpgrade()` to determine that the active node was the only non-updated node and then kill it to "complete" the update it assumed was in progress. Keep note of which nodes have actually been updated irrespective of whether they're reachable and healthy to prevent this issue. Fixes coreos#344

Chris Pick added 2 commits September 14, 2018 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/operator: don't erroneously "update" (kill) unhealthy active node #345

pkg/operator: don't erroneously "update" (kill) unhealthy active node #345

cpick commented Sep 14, 2018

pkg/operator: don't erroneously "update" (kill) unhealthy active node #345

Are you sure you want to change the base?

pkg/operator: don't erroneously "update" (kill) unhealthy active node #345

Conversation

cpick commented Sep 14, 2018