-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AKS Node not rebooted with lock held for not existing node #847
Comments
This seems to be a related to #822. The problem might appear, when a lock is held from a node which is removed from the cluster. However, the thing that the metric highlights a node which does not need a reboot is new, but may be a result of the wrong lock behaviour. |
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days). |
not stale |
This seems to happen when karpenter nodes get rebooted and karpenter nukes them before they come back. This is happening to me on EKS as well with lockttl set to 30m. |
@gyoza that sounds like a scenario where this could happen. My main question for folks that are experiencing this is: is the TTL configuration not working? At present, kured makes no guarantees that a node will continue to exist after it successfully acquires the lock (annotation on the kured daemonset); but it does guarantee that if you configure a lock TTL that the lock will be released after the TTL expires (whether or not the node that acquired the lock still exists at that time). Are we seeing different behavior than described above? |
@jackfrancis Exactly! I figured the lock would expire if the node was around or not but that does not seem to be the case. Daemonset:
Logs:
The only way i can get things to get back to work momentarily is to rollout restart the daemonset on each context. |
It appears that even on a daemonset rollout restart that specific node lock seems to show up again. |
Is there a way to force to clear the lock manually? |
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days). |
not stale |
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days). |
this could help https://kured.dev/docs/operation/#manual-unlock at least it works for us. Nevertheless it is really annoying that the lock-ttl setting doesn't solve the problem |
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days). |
not stale |
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days). |
not stale |
Hi,
we're facing an issue with the newest version of Kured 1.14.0.
Nodes are not rebooted
Prometheus Metrics say that a reboot is required but on the Node Host there is no file
/var/run/reboot-required
present.Adding the file manually to the node results in a message "warning msg="Lock already held:" for a not longer existing node.
We added the lockTtl flag from the Helm-Chart
The nodes will got the annotation
Do you have any idea why this happens?
I know, that messages are from today and not much time between config and possible reboots but the it was also hole last week, this are only the newest logs after redeployment with increased
endTime
Thank you in advance
André
The text was updated successfully, but these errors were encountered: