-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid picking old helm-delete job #232
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Jian Wang <[email protected]>
// remove old helm-delete job if it was left | ||
job, err := c.jobCache.Get(chart.Namespace, expectedJob.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to do this here, the job patcher already deletes and re-creates the job as necessary, as part of the Apply logic:
helm-controller/pkg/controllers/chart/chart.go
Lines 167 to 173 in 255f905
func (c *Controller) jobPatcher(namespace, name string, pt types.PatchType, data []byte) (runtime.Object, error) { err := c.jobs.Delete(namespace, name, &metav1.DeleteOptions{PropagationPolicy: &deletePolicy}) if err == nil || apierrors.IsNotFound(err) { return nil, fmt.Errorf("create or replace job") } return nil, err } helm-controller/pkg/controllers/chart/chart.go
Lines 126 to 129 in 255f905
c.apply = apply. WithCacheTypes(helms, confs, jobs, crbs, sas, cm, s). WithStrictCaching(). WithPatcher(jobs.GroupVersionKind(), c.jobPatcher) helm-controller/pkg/controllers/chart/chart.go
Lines 237 to 239 in 255f905
// if the job already exists but is tied to an install or upgrade, there will be a need to patch so // the apply will execute the jobPatcher, which will delete the install/upgrade job and recreate a uninstall job err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{
Is there some other bit of logic that the patcher needs to capture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the bug is due to WithOwner(chart).
in
WithOwner(chart). |
First round of create & delete a HelmChart X
HelmChart X with ID A
helm-delete job 1
helm-delete job 2 (and possiblely left orphaned)
Second round of create & delete a HelmChart X
Chart X with ID B
helm-delete job 2 will not be replaced due to `WithOwner`
But
job, err := c.jobCache.Get(chart.Namespace, expectedJob.Name) |
c.recorder.Eventf(chart, corev1.EventTypeNormal, "RemoveJob", "Uninstalled HelmChart using Job %s/%s, removing resources", job.Namespace, job.Name) |
That's the gap in the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we manually create a same namespaced & named helm-delete
job before deleting a HelmChart, I guess it will also shortcut the HelmChart's real deleting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, the WithOwner stuff uses the owning object's name, namespace, and GVK to track resources across namespaces. As long as the owning HelmChart object shares those same attributes across delete/create cycles it should track it properly. If you run the controller with --debug you should get debug logs from the DesiredSet stuff showing you what it's trying to do - does that shed any light on the situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add some debug #177 (comment) on Harvester environment, the logs /events can clearly show that, the generic.ConfigureApplyForObject
is not able to correctly identify the old left helm-delete
job.
From the debug, it looks essential to remove the old job.
Btw the created helm-delete
job, has no Owner
field; the apply has a interl owner field.
harv41:/home/rancher # kk get jobs -n cattle-logging-system helm-delete-rancher-logging -oyaml
apiVersion: batch/v1
kind: Job
metadata:
annotations:
batch.kubernetes.io/job-tracking: ""
objectset.rio.cattle.io/applied: ...
objectset.rio.cattle.io/id: helm-chart-registration
objectset.rio.cattle.io/owner-gvk: helm.cattle.io/v1, Kind=HelmChart
objectset.rio.cattle.io/owner-name: rancher-logging
objectset.rio.cattle.io/owner-namespace: cattle-logging-system
creationTimestamp: "2024-04-19T08:32:28Z"
finalizers:
- wrangler.cattle.io/promote-node-controller
generation: 1
labels:
helmcharts.helm.cattle.io/chart: rancher-logging
objectset.rio.cattle.io/hash: ea9e4622f05ef91896488127ad4be76f709ee325
name: helm-delete-rancher-logging
namespace: cattle-logging-system
resourceVersion: "739101"
uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
spec:
backoffLimit: 1000
completionMode: NonIndexed
completions: 1
parallelism: 1
selector:
matchLabels:
batch.kubernetes.io/controller-uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
suspend: false
template:
metadata:
annotations:
helmcharts.helm.cattle.io/configHash: SHA256=0832CE8EBAB899509A3AB82C345F3DF72C4D3300D5CDB2E0C47BB943F9630B34
creationTimestamp: null
labels:
batch.kubernetes.io/controller-uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
batch.kubernetes.io/job-name: helm-delete-rancher-logging
controller-uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
helmcharts.helm.cattle.io/chart: rancher-logging
job-name: helm-delete-rancher-logging
spec:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The orphaned job, was left per this return
// once we have run the above logic, we can now check if the job is complete
job, err = c.jobCache.Get(chart.Namespace, expectedJob.Name)
if apierrors.IsNotFound(err) {
// the above apply should have created it, something is wrong.
// if you are here, there must be a bug in the code.
@@ -269,7 +289,7 @@ func (c *Controller) OnRemove(key string, chart *v1.HelmChart) (*v1.HelmChart, e
chartCopy.Status.JobName = job.Name
// the `chart` may be on on apiserver side
newChart, err := c.helms.Update(chartCopy)
if err != nil {
return chart, fmt.Errorf("unable to update status of helm chart to add uninstall job name %s, %w", chartCopy.Status.JobName, err)
}
And the normal running path has this call
// note: an empty apply removes all resources owned by this chart
err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{
AllowClusterScoped: true,
}).
WithOwner(chart).
WithSetID("helm-chart-registration").
ApplyObjects()
I will test if add this call to the above return
works.
Signed-off-by: Jian Wang <[email protected]>
@@ -269,7 +289,21 @@ func (c *Controller) OnRemove(key string, chart *v1.HelmChart) (*v1.HelmChart, e | |||
chartCopy.Status.JobName = job.Name | |||
newChart, err := c.helms.Update(chartCopy) | |||
if err != nil { | |||
return chart, fmt.Errorf("unable to update status of helm chart to add uninstall job name %s", chartCopy.Status.JobName) | |||
// if chart is gone, clean resources | |||
if apierrors.IsNotFound(err) || strings.Contains(err.Error(), "StorageError") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the error you shared: StorageError: invalid object, Code: 4, Key: /registry/helm.cattle.io/helmcharts/cattle-logging-system/rancher-logging, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 78e8d91d-430e-4c7d-a3f9-156c461eedab, UID in object meta: , requeuing"
I think you can use https://pkg.go.dev/k8s.io/apiserver/pkg/storage#IsInvalidObj
if apierrors.IsNotFound(err) || strings.Contains(err.Error(), "StorageError") { | |
if apierrors.IsNotFound(err) || storage.IsInvalidObj(err) { |
It is odd that there's not an apierror for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, it is not like normal k8s package for client usage.
apierrors "k8s.io/apimachinery/pkg/api/errors"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything that prevents you from importing "k8s.io/apiserver/pkg/storage"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage.IsInvalidObj
is not working as expected, revert first.
helm-controller/pkg/controllers/chart/chart.go Lines 266 to 270 in 255f905
The comment from @aiyengar2 here about |
Fix issue #177.
The
helm-delete
job may be created multi times and remaining job will affect next round of helm-chart removing.The current code depends on below block to clean the
helm-delete
job.But there are 2 parts not covering:
(1) Below code may fail, due to chart object is gone
the return error is like:
The second commit can fix it (validated).
In the test, the orphaned
helm-delete
job can stay a long time.(2) When the POD/running process is killed and the
generic.ConfigureApplyForObject
is not executed.The current code sololy depends on the executing of
generic.ConfigureApplyForObject
to clean objects in the last step ofOnRemove
.The first commit will fix this, it always checks the existing
helm-delete
job to avoid picking the orphaned jobs.On k8s, the object deleting is a bit different than creating, the reconciler can re-run to reach the target finally. But the OnRemove may not always have chance to re-run.