Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The helm-delete job is not cleaned when helm release is deleted by helmchart CR #177

Open
GonzoHsu opened this issue Mar 6, 2023 · 24 comments

Comments

@GonzoHsu
Copy link

GonzoHsu commented Mar 6, 2023

When I run kubectl delete helmchart <name> command, most of time I see the helm-delete- job is still existing, and it seems it is caused by duplicated helm-delete job created by helm-controller

For example, run kubectl delete helmchart nats,

The job and pod for helm delete are existing,

# kubectl get job | grep nats
helm-delete-nats                  1/1           3s         19m
# kubectl get pod | grep nats
helm-delete-nats-mn4xd                        0/1     Completed   0              19m

The event show the pod for /helm-delete-nats generated twice

# kubectl get event | grep nats
22m         Normal    SuccessfulCreate                  job/helm-delete-nats                    Created pod: helm-delete-nats-7h84g
22m         Normal    Scheduled                         pod/helm-delete-nats-7h84g              Successfully assigned default/helm-delete-nats-7h84g to 96e53ca49eb7e011eda21c000c290bc835
22m         Normal    Pulled                            pod/helm-delete-nats-7h84g              Container image "rancher/klipper-helm:v0.7.4-build20221121" already present on machine
22m         Normal    Created                           pod/helm-delete-nats-7h84g              Created container helm
22m         Normal    Started                           pod/helm-delete-nats-7h84g              Started container helm
22m         Normal    Killing                           pod/nats-0                              Stopping container nats
22m         Normal    Killing                           pod/nats-0                              Stopping container metrics
22m         Normal    Killing                           pod/nats-0                              Stopping container reloader
22m         Warning   CalculateExpectedPodCountFailed   poddisruptionbudget/nats                Failed to calculate the number of expected pods: statefulsets.apps does not implement the scale subresource
22m         Normal    Completed                         job/helm-delete-nats                    Job completed
22m         Normal    RemoveJob                         helmchart/nats                          Uninstalled HelmChart using Job default/helm-delete-nats, removing resources
21m         Normal    SuccessfulCreate                  job/helm-delete-nats                    Created pod: helm-delete-nats-mn4xd
21m         Normal    Scheduled                         pod/helm-delete-nats-mn4xd              Successfully assigned default/helm-delete-nats-mn4xd to 96e53ca49eb7e011eda21c000c290bc835
21m         Normal    Pulled                            pod/helm-delete-nats-mn4xd              Container image "rancher/klipper-helm:v0.7.4-build20221121" already present on machine
21m         Normal    Created                           pod/helm-delete-nats-mn4xd              Created container helm
21m         Normal    Started                           pod/helm-delete-nats-mn4xd              Started container helm
21m         Normal    Completed                         job/helm-delete-nats                    Job completed

And the existing pod logs are as following, I think it means the job does not find the helm release to delete.

# kubectl logs helm-delete-nats-mn4xd
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
	echo "KUBERNETES_SERVICE_HOST is using IPv6"
	CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
	CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ '' != \t\r\u\e ]]
+ export HELM_HOST=127.0.0.1:44134
+ HELM_HOST=127.0.0.1:44134
+ tiller --listen=127.0.0.1:44134 --storage=secret
+ helm_v2 init --skip-refresh --client-only --stable-repo-url https://charts.helm.sh/stable/
[main] 2023/03/06 09:51:27 Starting Tiller v2.17.0 (tls=false)
[main] 2023/03/06 09:51:27 GRPC listening on 127.0.0.1:44134
[main] 2023/03/06 09:51:27 Probes listening on :44135
[main] 2023/03/06 09:51:27 Storage driver is Secret
[main] 2023/03/06 09:51:27 Max history per release is 0
Creating /home/klipper-helm/.helm 
Creating /home/klipper-helm/.helm/repository 
Creating /home/klipper-helm/.helm/repository/cache 
Creating /home/klipper-helm/.helm/repository/local 
Creating /home/klipper-helm/.helm/plugins 
Creating /home/klipper-helm/.helm/starters 
Creating /home/klipper-helm/.helm/cache/archive 
Creating /home/klipper-helm/.helm/repository/repositories.yaml 
Adding stable repo with URL: https://charts.helm.sh/stable/ 
Adding local repo with URL: http://127.0.0.1:8879/charts 
$HELM_HOME has been configured at /home/klipper-helm/.helm.
Not installing Tiller due to 'client-only' flag having been set
++ timeout -s KILL 30 helm_v2 ls --all '^nats$' --output json
++ jq -r '.Releases | length'
[storage] 2023/03/06 09:51:27 listing all releases with filter
+ V2_CHART_EXISTS=
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ [[ -f /config/ca-file.pem ]]
+ [[ -n '' ]]
+ shopt -s nullglob
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/nats.tgz.base64
+ CHART_PATH=/tmp/nats.tgz
+ [[ ! -f /chart/nats.tgz.base64 ]]
+ base64 -d /chart/nats.tgz.base64
+ CHART=/tmp/nats.tgz
+ set +e
+ [[ delete != \d\e\l\e\t\e ]]
+ helm_update delete
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls --all -f '^nats$' --namespace default --output json
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
+ LINE=null,null
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ for VALUES_FILE in /config/*.yaml
+ VALUES=' --values /config/values-01_HelmChart.yaml'
+ [[ delete = \d\e\l\e\t\e ]]
+ [[ -z null ]]
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ echo 'Uninstalling helm_v3 chart'
+ helm_v3 uninstall nats --namespace default
Error: uninstall: Release not loaded: nats: release: not found
+ true
+ exit
@brunomedeirosdedalus
Copy link

brunomedeirosdedalus commented May 22, 2023

Yep, we are also experiencing the same behavior.

@brandond
Copy link
Member

brandond commented May 22, 2023

I'm unable to replicate this. What version of k3s or helm-controller are you on? The code here should delete any objects owned by the HelmChart:

// uninstall job has successfully finished!
c.recorder.Eventf(chart, corev1.EventTypeNormal, "RemoveJob", "Uninstalled HelmChart using Job %s/%s, removing resources", job.Namespace, job.Name)
// note: an empty apply removes all resources owned by this chart
err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{
AllowClusterScoped: true,
}).
WithOwner(chart).
WithSetID("helm-chart-registration").
ApplyObjects()
if err != nil {
return nil, fmt.Errorf("unable to remove resources tied to HelmChart %s/%s: %s", chart.Namespace, chart.Name, err)
}

@GonzoHsu
Copy link
Author

The k3s version I used is:
k3s version v1.23.16+k3s1 (64b0feeb)

This condition will not happen every time, if you tried replicate it in a short time slot, you may not able to find it.

I also check the code and I think the following part cause that, but I am not sure what timing can cause that.

err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{
AllowClusterScoped: true,
}).
WithOwner(chart).
WithSetID("helm-chart-registration").
ApplyObjects(append(objs, expectedJob)...)

When the condition happened, not only the delete job, but also some configmaps of helmchart is created again.

@brandond
Copy link
Member

This condition will not happen every time, if you tried replicate it in a short time slot, you may not able to find it.

I've run the end to end tests quite a few times and not been able to reproduce it; do you have any circumstances or specific steps that seem to contribute to it? Deleting the HelmChart too quickly after creating it, deleting the namespace before the HelmChart, or so on?

@GonzoHsu
Copy link
Author

My circumstance is deploying a Rocky Linux VM on VMware ESXi server with 2 vCPU and 8G RAM, then install k3s on the VM.

I didn't do specific steps to create/delete the HelmChart, just use command "kubectl apply/delete -f <HelmChart.yaml>"

@brandond
Copy link
Member

How long did you wait between applying and deleting it? Did the install succeed? Was it still mid-install when deleted?

@GonzoHsu
Copy link
Author

The install is succeed, I can see my pod runs without problem, not in mid-install state.
The time between applying and deleting can be few minutes to few days, I didn't find relation about it.

@w13915984028
Copy link

w13915984028 commented Jun 29, 2023

I suspect we encounter similar issue in Harvester.

When create/delete the same HelmChart (only deployment, not CRD) frequently, there are bugs:

The previous left delete job, will be picked by the next time re-installed helmchart, when which is deleted again.

harv41:/home/rancher # kk get jobs -A  && kk get jobs -n cattle-monitoring-system helm-delete-rancher-monitoring -oyaml
NAMESPACE                  NAME                               COMPLETIONS   DURATION   AGE
cattle-monitoring-system   helm-delete-rancher-monitoring     1/1           4s         3m41s
kube-system                helm-install-rke2-canal            1/1           12s        7d2h

status:
  completionTime: "2023-06-29T10:31:52Z"
  conditions:
  - lastProbeTime: "2023-06-29T10:31:52Z"
    lastTransitionTime: "2023-06-29T10:31:52Z"
    status: "True"
    type: Complete
  ready: 0
  startTime: "2023-06-29T10:31:48Z"
  succeeded: 1
  uncountedTerminatedPods: {}

This bug, causes the HelmChart related downstream deployments are still left there.

@w13915984028
Copy link

Harvester addon is on top of HelmChart, and we are adding workaround: each time when deploy or delete the HelmChart, delete the potential previous job first.

harvester/harvester#4108 (comment)

@w13915984028
Copy link

w13915984028 commented Jun 29, 2023

@brandond
The JobName of a chart is fixed, not generated, if previous job is not out-dated, it maybe re-picked next time, and assume the job is done quickly.

From my test in Harvester, it is highly possible due to this, both install and delete.

harv41:/home/rancher # kk get job -A
NAMESPACE                  NAME                               COMPLETIONS   DURATION   AGE
cattle-monitoring-system   helm-install-rancher-monitoring    1/1           52s        3m13s

image

@w13915984028
Copy link

w13915984028 commented Jun 29, 2023

One HelmChart delete seems triggers 2 times of delete job, the HelmChart is gone, but one job is left, and it will affect next round delete.

Did following test, log events and jobs:

First round:

1.1 create HelmChart
cattle-monitoring-system   13m         Normal    SuccessfulCreate         job/helm-install-rancher-monitoring                           Created pod: helm-install-rancher-monitoring-wnrm6
cattle-monitoring-system   13m         Normal    Completed                job/helm-install-rancher-monitoring                           Job completed

1.2 delete Helmchart
cattle-monitoring-system   12m         Normal    SuccessfulCreate         job/helm-delete-rancher-monitoring                            Created pod: helm-delete-rancher-monitoring-d2n62
cattle-monitoring-system   12m         Normal    Completed                job/helm-delete-rancher-monitoring                            Job completed
cattle-monitoring-system   12m         Normal    SuccessfulCreate         job/helm-delete-rancher-monitoring                            Created pod: helm-delete-rancher-monitoring-7lkj9
cattle-monitoring-system   12m         Normal    Completed                job/helm-delete-rancher-monitoring                            Job completed

1.3 2 jobs are left:
NAMESPACE                  NAME                               COMPLETIONS   DURATION   AGE
cattle-monitoring-system   helm-delete-rancher-monitoring     1/1           4s         9m48s
cattle-monitoring-system   helm-install-rancher-monitoring    1/1           39s        47s

Second round:

2.1 create HelmChart
cattle-monitoring-system   3m10s       Normal    SuccessfulCreate         job/helm-install-rancher-monitoring                           Created pod: helm-install-rancher-monitoring-b2kcz
cattle-monitoring-system   2m31s       Normal    Completed                job/helm-install-rancher-monitoring                           Job completed

2.2 delete Helmchart
cattle-monitoring-system   117s        Normal    SuccessfulCreate         job/helm-delete-rancher-monitoring                            Created pod: helm-delete-rancher-monitoring-5dvlw
cattle-monitoring-system   107s        Normal    Completed                job/helm-delete-rancher-monitoring                            Job completed
cattle-monitoring-system   105s        Normal    SuccessfulCreate         job/helm-delete-rancher-monitoring                            Created pod: helm-delete-rancher-monitoring-8dfh9
cattle-monitoring-system   101s        Normal    Completed                job/helm-delete-rancher-monitoring                            Job completed



2.3 1 job is left 
NAMESPACE                  NAME                               COMPLETIONS   DURATION   AGE
cattle-monitoring-system   helm-delete-rancher-monitoring     1/1           4s         2m57s

With a workaround in Harvester to forcely delete the job before trigger HelmChart action, we can avoid such issue now.
harvester/harvester#4127

time="2023-06-29T14:56:52Z" level=info msg="OnChange: user disable addon, move from AddonDeploySuccessful to new disable status AddonDisabling"



///// before delete HelmChart, delete those 2 left jobs

time="2023-06-29T14:56:52Z" level=info msg="previous job cattle-monitoring-system/helm-delete-rancher-monitoring is to be deleted, wait"
W0629 14:56:52.802777       7 warnings.go:70] child pods are preserved by default when jobs are deleted; set propagationPolicy=Background to remove them or set propagationPolicy=Orphan to suppress this warning
time="2023-06-29T14:56:52Z" level=info msg="previous job cattle-monitoring-system/helm-install-rancher-monitoring is to be deleted, wait"
W0629 14:56:52.813165       7 warnings.go:70] child pods are preserved by default when jobs are deleted; set propagationPolicy=Background to remove them or set propagationPolicy=Orphan to suppress this warning


time="2023-06-29T14:56:57Z" level=info msg="delete the helm chart cattle-monitoring-system/rancher-monitoring"

...
time="2023-06-29T14:57:10Z" level=info msg="addon rancher-monitoring: helm chart is gone, or owned false, addon is in AddonDisabling status, move to init state"

@brandond
Copy link
Member

@w13915984028 I am curious why you don't see any of the events from the helm controller itself when you delete the chart. Are you filtering these out? What events and log messages do you see?

Rather than adding code to Harvester to manually clean up after helm-controller, would you mind making an attempt to fix the unwanted behavior here? It should be somewhere in the OnRemove function at https://github.com/k3s-io/helm-controller/blob/master/pkg/controllers/chart/chart.go#L201.

@x-coder-L
Copy link

x-coder-L commented Jan 18, 2024

@brandond We meet the same problem, and it's real weird that HelmChart delete seems triggers 2 times of delete job.
Here is our environment version:
k3s version v1.28.4+k3s2 (6ba6c1b6)
go version go1.20.11
Did Following test,command ,log event and jobs:
1.First Round:
Apply the test helmchart.

[root@localhost k8s_env]# kubectl apply -f run-test.yaml 
helmchart.helm.cattle.io/test created

Check the helm-install job ,event logs and pod resource

[root@localhost k8s_env]# kubectl get job -A
NAMESPACE     NAME                              COMPLETIONS   DURATION   AGE
kube-system   helm-install-test                 0/1           9s         9s
[root@localhost k8s_env]# kubectl get event -A
kube-system            4s          Normal    SuccessfulCreate                 job/helm-install-test                               Created pod: helm-install-test-5v2kp
kube-system            4s          Normal    Scheduled                        pod/helm-install-test-5v2kp                         Successfully assigned kube-system/helm-install-test-5v2kp to localhost
kube-system            4s          Normal    ApplyJob                         helmchart/test                                      Applying HelmChart using Job kube-system/helm-install-test
kube-system            4s          Normal    AddedInterface                   pod/helm-install-test-5v2kp                         Add eth0 [10.42.2.123/16] from cnibr
kube-system            4s          Normal    Pulled                           pod/helm-install-test-5v2kp                         Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system            4s          Normal    Created                          pod/helm-install-test-5v2kp                         Created container helm
kube-system            4s          Normal    Started                          pod/helm-install-test-5v2kp                         Started container helm
default                3s          Normal    ScalingReplicaSet                deployment/test                                     Scaled up replica set test-9f64fdb7f to 1
default                3s          Normal    SuccessfulCreate                 replicaset/test-9f64fdb7f                           Created pod: test-9f64fdb7f-smf22
default                3s          Normal    Scheduled                        pod/test-9f64fdb7f-smf22                            Successfully assigned default/test-9f64fdb7f-smf22 to localhost
default                3s          Normal    AddedInterface                   pod/test-9f64fdb7f-smf22                            Add eth0 [10.42.2.124/16] from cnibr
default                3s          Normal    Pulled                           pod/test-9f64fdb7f-smf22                            Container image "test" already present on machine
default                3s          Normal    Created                          pod/test-9f64fdb7f-smf22                            Created container test
default                3s          Normal    Started                          pod/test-9f64fdb7f-smf22                            Started container test
kube-system            1s          Normal    Completed                        job/helm-install-test                               Job completed
[root@localhost k8s_env]# kubectl get pod 
NAME                   READY   STATUS    RESTARTS   AGE
test-9f64fdb7f-smf22   1/1     Running   0          3m13s

Delete helmchart after pod running and check event logs and job.

[root@localhost k8s_env]# kubectl delete -f run-test.yaml 
helmchart.helm.cattle.io "test" deleted
[root@localhost k8s_env]# kubectl get pod 
NAME                   READY   STATUS        RESTARTS   AGE
test-9f64fdb7f-smf22   1/1     Terminating   0          4m50s
[root@localhost k8s_env]# kubectl get job -A
NAMESPACE     NAME                              COMPLETIONS   DURATION   AGE
kube-system   helm-delete-test                  1/1           3s         15s|
[root@localhost k8s_env]# kubectl get event -A
kube-system            61s         Normal    SuccessfulCreate                 job/helm-delete-test                                Created pod: helm-delete-test-49284
kube-system            60s         Normal    Scheduled                        pod/helm-delete-test-49284                          Successfully assigned kube-system/helm-delete-test-49284 to localhost
kube-system            61s         Normal    AddedInterface                   pod/helm-delete-test-49284                          Add eth0 [10.42.2.126/16] from cnibr
kube-system            60s         Normal    Pulled                           pod/helm-delete-test-49284                          Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system            60s         Normal    Created                          pod/helm-delete-test-49284                          Created container helm
kube-system            60s         Normal    Started                          pod/helm-delete-test-49284                          Started container helm
default                60s         Normal    Killing                          pod/test-9f64fdb7f-smf22                            Stopping container test
kube-system            58s         Normal    Completed                        job/helm-delete-test                                Job completed
kube-system            55s         Normal    RemoveJob                        helmchart/test                                      Uninstalled HelmChart using Job kube-system/helm-delete-test, removing resources
kube-system            55s         Normal    SuccessfulCreate                 job/helm-delete-test                                Created pod: helm-delete-test-vj4b2
kube-system            54s         Normal    Scheduled                        pod/helm-delete-test-vj4b2                          Successfully assigned kube-system/helm-delete-test-vj4b2 to localhost
kube-system            55s         Normal    AddedInterface                   pod/helm-delete-test-vj4b2                          Add eth0 [10.42.2.127/16] from cnibr
kube-system            54s         Normal    Pulled                           pod/helm-delete-test-vj4b2                          Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system            54s         Normal    Created                          pod/helm-delete-test-vj4b2                          Created container helm
kube-system            54s         Normal    Started                          pod/helm-delete-test-vj4b2                          Started container helm
kube-system            52s         Normal    Completed                        job/helm-delete-test                                Job completed
[root@localhost k8s_env]# kubectl get pod 
No resources found in default namespace.

From the log events and job , we find after the first delete job completed and removed ,k3s start a second delete job without removed,so there is a delete job left.

2.Second Round:

[root@localhost k8s_env]# kubectl apply -f run-test.yaml 
helmchart.helm.cattle.io/test created
[root@localhost k8s_env]# kubectl get pod
NAME                   READY   STATUS    RESTARTS   AGE
test-9f64fdb7f-wbhrd   1/1     Running   0          5s
[root@localhost k8s_env]# kubectl get jobs.batch -A
NAMESPACE     NAME                              COMPLETIONS   DURATION   AGE
kube-system   helm-delete-test                  1/1           3s         8m20s
kube-system   helm-install-test                 1/1           3s         13s
[root@localhost k8s_env]# kubectl get event -A
kube-system            48s         Normal    SuccessfulCreate                 job/helm-install-test                               Created pod: helm-install-test-9jkgj
kube-system            47s         Normal    Scheduled                        pod/helm-install-test-9jkgj                         Successfully assigned kube-system/helm-install-test-9jkgj to localhost
kube-system            48s         Normal    ApplyJob                         helmchart/test                                      Applying HelmChart using Job kube-system/helm-install-test
kube-system            47s         Normal    AddedInterface                   pod/helm-install-test-9jkgj                         Add eth0 [10.42.2.128/16] from cnibr
kube-system            47s         Normal    Pulled                           pod/helm-install-test-9jkgj                         Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system            47s         Normal    Created                          pod/helm-install-test-9jkgj                         Created container helm
kube-system            47s         Normal    Started                          pod/helm-install-test-9jkgj                         Started container helm
default                47s         Normal    ScalingReplicaSet                deployment/test                                     Scaled up replica set test-9f64fdb7f to 1
default                47s         Normal    SuccessfulCreate                 replicaset/test-9f64fdb7f                           Created pod: test-9f64fdb7f-wbhrd
default                46s         Normal    Scheduled                        pod/test-9f64fdb7f-wbhrd                            Successfully assigned default/test-9f64fdb7f-wbhrd to localhost
default                46s         Normal    AddedInterface                   pod/test-9f64fdb7f-wbhrd                            Add eth0 [10.42.2.129/16] from cnibr
default                46s         Normal    Pulled                           pod/test-9f64fdb7f-wbhrd                            Container image "test" already present on machine
default                46s         Normal    Created                          pod/test-9f64fdb7f-wbhrd                            Created container test
default                46s         Normal    Started                          pod/test-9f64fdb7f-wbhrd                            Started container test
kube-system            45s         Normal    Completed                        job/helm-install-test                               Job completed
[root@localhost k8s_env]# kubectl delete -f run-test.yaml 
helmchart.helm.cattle.io "test" deleted
[root@localhost k8s_env]# kubectl get pod 
NAME                   READY   STATUS    RESTARTS   AGE
test-9f64fdb7f-wbhrd   1/1     Running   0          3m6s
[root@localhost k8s_env]# kubectl get job -A
NAMESPACE     NAME                              COMPLETIONS   DURATION   AGE
[root@localhost k8s_env]# kubectl get event -A
kube-system            46s         Normal    RemoveJob                        helmchart/test                                      Uninstalled HelmChart using Job kube-system/helm-delete-test, removing resources

In the second round , helm-delete process is just remove the left job after first round,but the pod is still running.

3.Third Round:
we want check more helm-controller debug logs,so we set disable-helm-controller arg,and download helm controller process with version v0.15.4 which is the same version used in k3s version v1.28.4 .After we run the program manually and do the same test,we get the different result.

[root@localhost k8s_env]# vim /usr/lib/systemd/system/k3s.service
[root@localhost k8s_env]# systemctl daemon-reload 
[root@localhost k8s_env]# systemctl restart k3s
[root@localhost helm-controller]# ./helm-controller-amd64 --kubeconfig /etc/rancher/k3s/k3s.yaml &
[root@localhost k8s_env]# kubectl apply -f run-test.yaml 
helmchart.helm.cattle.io/test created
[root@localhost k8s_env]# kubectl get pod 
NAME                   READY   STATUS    RESTARTS   AGE
test-9f64fdb7f-cjlxc   1/1     Running   0          3s
[root@localhost k8s_env]# kubectl get job -A
NAMESPACE     NAME                              COMPLETIONS   DURATION   AGE
kube-system   helm-install-test                 1/1           4s         11s
[root@localhost k8s_env]# kubectl get event -A
kube-system            49s         Normal    SuccessfulCreate                 job/helm-install-test                               Created pod: helm-install-test-tmmgm
kube-system            48s         Normal    Scheduled                        pod/helm-install-test-tmmgm                         Successfully assigned kube-system/helm-install-test-tmmgm to localhost
kube-system            49s         Normal    AddedInterface                   pod/helm-install-test-tmmgm                         Add eth0 [10.42.2.144/16] from cnibr
kube-system            48s         Normal    Pulled                           pod/helm-install-test-tmmgm                         Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system            48s         Normal    Created                          pod/helm-install-test-tmmgm                         Created container helm
kube-system            48s         Normal    Started                          pod/helm-install-test-tmmgm                         Started container helm
default                48s         Normal    ScalingReplicaSet                deployment/test                                     Scaled up replica set test-9f64fdb7f to 1
default                48s         Normal    SuccessfulCreate                 replicaset/test-9f64fdb7f                           Created pod: test-9f64fdb7f-cjlxc
default                47s         Normal    Scheduled                        pod/test-9f64fdb7f-cjlxc                            Successfully assigned default/test-9f64fdb7f-cjlxc to localhost
default                47s         Normal    AddedInterface                   pod/test-9f64fdb7f-cjlxc                            Add eth0 [10.42.2.145/16] from cnibr
default                47s         Normal    Pulled                           pod/test-9f64fdb7f-cjlxc                            Container image "test" already present on machine
default                47s         Normal    Created                          pod/test-9f64fdb7f-cjlxc                            Created container test
default                47s         Normal    Started                          pod/test-9f64fdb7f-cjlxc                            Started container test
kube-system            45s         Normal    Completed                        job/helm-install-test                               Job completed
kube-system            45s         Normal    ApplyJob                         helmchart/test                                      Applying HelmChart using Job kube-system/helm-install-test
[root@localhost k8s_env]# kubectl delete -f run-test.yaml 
helmchart.helm.cattle.io "test" deleted
[root@localhost k8s_env]# kubectl get pod 
NAME                   READY   STATUS        RESTARTS   AGE
test-9f64fdb7f-cjlxc   1/1     Terminating   0          2m45s
[root@localhost k8s_env]# kubectl get job -A
NAMESPACE     NAME                              COMPLETIONS   DURATION   AGE
[root@localhost k8s_env]# kubectl get event -A
kube-system            52s         Normal    SuccessfulCreate                 job/helm-delete-test                                Created pod: helm-delete-test-sjfpn
kube-system            51s         Normal    Scheduled                        pod/helm-delete-test-sjfpn                          Successfully assigned kube-system/helm-delete-test-sjfpn to localhost
kube-system            52s         Normal    AddedInterface                   pod/helm-delete-test-sjfpn                          Add eth0 [10.42.2.146/16] from cnibr
kube-system            51s         Normal    Pulled                           pod/helm-delete-test-sjfpn                          Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system            51s         Normal    Created                          pod/helm-delete-test-sjfpn                          Created container helm
kube-system            51s         Normal    Started                          pod/helm-delete-test-sjfpn                          Started container helm
default                51s         Normal    Killing                          pod/test-9f64fdb7f-cjlxc                            Stopping container test
kube-system            49s         Normal    Completed                        job/helm-delete-test                                Job completed
kube-system            46s         Normal    SuccessfulCreate                 job/helm-delete-test                                Created pod: helm-delete-test-qq5zc
kube-system            45s         Normal    Scheduled                        pod/helm-delete-test-qq5zc                          Successfully assigned kube-system/helm-delete-test-qq5zc to localhost
kube-system            46s         Normal    AddedInterface                   pod/helm-delete-test-qq5zc                          Add eth0 [10.42.2.147/16] from cnibr
kube-system            45s         Normal    Pulled                           pod/helm-delete-test-qq5zc                          Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system            45s         Normal    Created                          pod/helm-delete-test-qq5zc                          Created container helm
kube-system            45s         Normal    Started                          pod/helm-delete-test-qq5zc                          Started container helm
kube-system            43s         Normal    Completed                        job/helm-delete-test                                Job completed
kube-system            40s         Normal    RemoveJob                        helmchart/test                                      Uninstalled HelmChart using Job kube-system/helm-delete-test, removing resources

Here, you can see the different solutions to the test. The helm-delete job has been cleaned up and the event logs show that after two jobs are completed, the RemoveJob is triggered.

Here is helm-controller logs:

[root@localhost helm-controller]# INFO[0863] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255465", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test 
ERRO[0863] error syncing 'kube-system/test': handler helm-controller-chart-registration: helmcharts.helm.cattle.io "test" not found, requeuing 
INFO[0863] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255465", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test 
INFO[0863] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255472", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test 
INFO[0863] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255472", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test 
INFO[0866] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255472", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test 
INFO[0866] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255472", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test 
ERRO[1011] error syncing 'kube-system/test': handler on-helm-chart-remove: waiting for delete of helm chart for kube-system/test by helm-delete-test, requeuing 
INFO[1014] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255707", FieldPath:""}): type: 'Normal' reason: 'RemoveJob' Uninstalled HelmChart using Job kube-system/helm-delete-test, removing resources 
ERRO[1017] error syncing 'kube-system/test': handler on-helm-chart-remove: waiting for delete of helm chart for kube-system/test by helm-delete-test, requeuing 
INFO[1020] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255739", FieldPath:""}): type: 'Normal' reason: 'RemoveJob' Uninstalled HelmChart using Job kube-system/helm-delete-test, removing resources 

It's strange that the test with different solutions resulted in the following questions:
1.Why was the helm-delete job triggered twice?
2.Check the logic and logs of helm-controller, if a job is complete ,the RemoveJob will be triggered.Why the RemoveJob was triggered once in first round and second round test which the helm-delete job was triggered twice?

@up-wei
Copy link

up-wei commented Feb 23, 2024

Any resolution in 2024?

@up-wei
Copy link

up-wei commented Apr 10, 2024

I reproduced it in my environment. There is a high probability of reproducing it in my environment.
There are my questions:
Why there is still a helm-delete-nginx job after the first delete manifest(Reproduce step 2) is executed successfully?
Why are the deployment and pods related to the manifest not deleted after the second delete manifest(Reproduce step 4)?

Here is my environment and steps to reproduce.
Environmental Info:

  • k3s version
    k3s version v1.29.2+k3s1 (86f10213)
    go version #go1.21.7
  • OS version
    Rocky Linux 8.9 (Green Obsidian)
  • System infomation
   Linux localhost.localdomain 4.18.0-513.5.1.el8_9.x86_64 #1 SMP Fri Nov 17 03:31:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Cluster info
    1 server 1 agent
  • K3s systemd service info
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/usr/lib/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/k3s \
    server \

Reproduce Preparation

  • manifest content (The content is saved in mainfest.yaml in my env. )
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: nginx
  namespace: kube-system
spec:
  chart: https://%{KUBERNETES_API}%/static/charts/nginx-15.14.2.tgz
  targetNamespace: default

Reproduce Step:

  • Step1: Apply the mainfest.

[root@localhost testing]# kubectl apply -f mainfest.yaml
helmchart.helm.cattle.io/nginx created

After applying the mainfest is successful. Will see the following:

[root@localhost testing]# kubectl get helmchart -n kube-system nginx
NAME    JOB                  CHART                                                        TARGETNAMESPACE   VERSION   REPO   HELMVERSION   BOOTSTRAP
nginx   helm-install-nginx   https://%{KUBERNETES_API}%/static/charts/nginx-15.14.2.tgz   default

[root@localhost testing]# kubectl get deployment
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   1/1     1            1           3m34s

[root@localhost testing]# kubectl get pod 
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5864d77f94-qlmtx   1/1     Running   0          3m11s

[root@localhost testing]# kubectl get job -n kube-system
NAME                       COMPLETIONS   DURATION   AGE
helm-install-traefik-crd   1/1           24s        3h9m
helm-install-traefik       1/1           37s        3h9m
helm-install-nginx         1/1           4s         4m2s
  • Step 2: Delete the mainfest.

[root@localhost testing]# kubectl delete -f mainfest.yaml
helmchart.helm.cattle.io "nginx" deleted

After successfully deleting the mainfest. Will see the following:

[root@localhost testing]# kubectl get helmchart -n kube-system nginx
Error from server (NotFound): helmcharts.helm.cattle.io "nginx" not found

[root@localhost testing]# kubectl get deployment
No resources found in default namespace.

[root@localhost testing]# kubectl get pod 
No resources found in default namespace.

[root@localhost testing]# kubectl get job -n kube-system
NAME                       COMPLETIONS   DURATION   AGE
helm-install-traefik-crd   1/1           24s        3h21m
helm-install-traefik       1/1           37s        3h21m
helm-delete-nginx          1/1           4s         4m55s
  • Step 3: Apply the same manifest again.

[root@localhost testing]# kubectl apply -f mainfest.yaml 
helmchart.helm.cattle.io/nginx created

After successfully deleting the mainfest again. Will see the following:

[root@localhost testing]# kubectl get helmchart -n kube-system
NAME          JOB                        CHART                                                                      TARGETNAMESPACE   VERSION   REPO   HELMVERSION   BOOTSTRAP                                                    
nginx         helm-install-nginx         https://%{KUBERNETES_API}%/static/charts/nginx-15.14.2.tgz                 default

[root@localhost testing]# kubectl get deployment
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   1/1     1            1           2m28s

[root@localhost testing]# kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5864d77f94-lrhlc   1/1     Running   0          2m39s

[root@localhost testing]# kubectl get job -n kube-system
NAME                       COMPLETIONS   DURATION   AGE
helm-install-traefik-crd   1/1           24s        3h26m
helm-install-traefik       1/1           37s        3h26m
helm-delete-nginx          1/1           4s         9m23s
helm-install-nginx         1/1           6s         3m3s
  • Step 4: Delete the same manifest again.

[root@localhost testing]# kubectl delete -f mainfest.yaml 
helmchart.helm.cattle.io "nginx" deleted

After successfully deleting the mainfest again. Will see the following:

[root@localhost testing]# kubectl get helmchart -n kube-system nginx
Error from server (NotFound): helmcharts.helm.cattle.io "nginx" not found

[root@localhost testing]# kubectl get deployment 
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   1/1     1            1           9m32s
[root@localhost testing]# kubectl get pod 
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5864d77f94-lrhlc   1/1     Running   0          9m46s

[root@localhost testing]# kubectl get job -n kube-system
NAME                       COMPLETIONS   DURATION   AGE
helm-install-traefik-crd   1/1           24s        3h33m
helm-install-traefik       1/1           37s        3h33m

@x-coder-L
Copy link

x-coder-L commented Apr 10, 2024

I reproduced it in my environment. There is a high probability of reproducing it in my environment.

@brandond This situation is same to my test.We are certain that we can reproduce this issue 100%.
Could this issue possibly be caused by different pkg versions that k3s and helm-controller both depend on?

@w13915984028
Copy link

w13915984028 commented Apr 14, 2024

Finally, I guess the root cause is found:

  1. This issue can be easily reproduced, as long as you create & delete the helm-chart object quickly.

Below log confirmed my guess, will explain in following steps.

Apr 14 20:25:40 harv2 rke2[1997]: time="2024-04-14T20:25:40Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rancher-logging\" not found, requeuing"
Apr 14 20:27:02 harv2 rke2[1997]: time="2024-04-14T20:27:02Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: waiting for delete of helm chart for cattle-logging-system/rancher-logging by helm-delete-rancher-logging, requeuing"
Apr 14 20:27:08 harv2 rke2[1997]: time="2024-04-14T20:27:08Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: unable to update status of helm chart to add uninstall job name helm-delete-rancher-logging, requeuing"
  1. The tricky comes from the time sequence:
func (c *Controller) OnRemove(key string, chart *v1.HelmChart) (*v1.HelmChart, error) {
...
	expectedJob, objs, err := c.getJobAndRelatedResources(chart)
...
--> (A) create a job
	// note: on the logic of running an apply here...
	// if the uninstall job does not exist, it will create it
	// if the job already exists and it is uninstalling, nothing will change since there's no need to patch
	// if the job already exists but is tied to an install or upgrade, there will be a need to patch so
	// the apply will execute the jobPatcher, which will delete the install/upgrade job and recreate a uninstall job
	err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{
		AllowClusterScoped: true,
	}).
		WithOwner(chart).
		WithSetID("helm-chart-registration").
		ApplyObjects(append(objs, expectedJob)...)
...

	// sleep for 3 seconds to give the job time to perform the helm install
	// before emitting any errors
	time.Sleep(3 * time.Second)

...

-> (B) check job finished or not, and update object

	// the first time we call this, the job will definitely not be complete... however,
	// throwing an error here will re-enqueue this controller, which will process the apply again
	// and get the job object from the cache to check again
	if job.Status.Succeeded <= 0 {
		// temporarily recreate the chart, but keep the deletion timestamp
		chartCopy := chart.DeepCopy()
		chartCopy.Status.JobName = job.Name
		newChart, err := c.helms.Update(chartCopy)
		if err != nil {
			return chart, fmt.Errorf("unable to update status of helm chart to add uninstall job name %s", chartCopy.Status.JobName)
		}
		return newChart, fmt.Errorf("waiting for delete of helm chart for %s by %s", key, job.Name)
	}

-> (C) log event

	// uninstall job has successfully finished!
	c.recorder.Eventf(chart, corev1.EventTypeNormal, "RemoveJob", "Uninstalled HelmChart using Job %s/%s, removing resources", job.Namespace, job.Name)

-> (D) clean object, including the delete job
	// note: an empty apply removes all resources owned by this chart
...

	return chart, nil
}

K8s takes time to delete an object. In above, the OnRemove may be called several times (typically 2 times). In the first call, it will create a job, remove resource, wait &/ log event, clean the job. If it is called again, it will repeat those steps. If it hit step (B) and returns on error, then a helm-delete job is left, and it will affect later helm delete. If in other steps, then lucky.

In my local environment (Harvester cluster), I did see two times of helm-delete object creating. The second helm-delete job just was left orphaned, as helm-chart object was gone, and wrangler framework will not call OnRemove for this helm-chart object again.

Apr 14 20:27:08 harv2 rke2[1997]: time="2024-04-14T20:27:08Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: unable to update status of helm chart to add uninstall job name helm-delete-rancher-logging, requeuing"

It can be assumed that this is a sequence issue.

  1. Workaround and solution.

In Harvester, we added a workaroud , #177 (comment), to delete the potential previous job explicitly.

The original helm-controller is a light-weight & handy tool to deploy helm chart, and very rarely the helm-chart will be deleted. Or there are long interval between creating and deleting of helm-chart object, the left helm-delete job will be removed automatically.

But when it is used to handle the case of frequent creating/removing helm-chart, some enhancements are required.

@up-wei
Copy link

up-wei commented Apr 15, 2024

@w13915984028 Thank you for your replay.

In Harvester, we added a workaroud , #177 (comment), to delete the potential previous job explicitly

it is useful to me.

There is another approach to solve this issue . First, disable the Helm controller option in k3s service. Then, run the Helm controller as a separate process.
The k3s service as follow:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/usr/lib/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/k3s \
    server \
    '--disable-helm-controller' \

The command for executing helm-controller is as follows. I download the pre-compiled binary for helm-controller and ensure that it is compatible with the K3s version.

./bin/helm-controller --kubeconfig /etc/rancher/k3s/k3s.yaml

@w13915984028
Copy link

w13915984028 commented Apr 15, 2024

@up-wei Change the system service to k8s pod/ independent process, may affect the required running time of the program, and hence jump the bug. But I suspect it can work in all cases.

I will add a PR, which will compare the creation timestamp of the job and the deletion timestamp of the helm-chart to avoid picking those old jobs.

@w13915984028
Copy link

w13915984028 commented Apr 19, 2024

I added some debug information and get below logs:

(1) The debug code w13915984028@c0844ed

We did several times of test, the helm-controller logs are:

(1) enable & disable rancher-logging (backed by helmchart); normal

Apr 19 11:23:44 harv41 rke2[29380]: time="2024-04-19T11:23:44Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:23:44 harv41 rke2[29380]: time="2024-04-19T11:23:44Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rancher-logging\" not found, requeuing"
Apr 19 11:23:44 harv41 rke2[29380]: time="2024-04-19T11:23:44Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:23:44 harv41 rke2[29380]: time="2024-04-19T11:23:44Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:25:08 harv41 rke2[29380]: time="2024-04-19T11:25:08Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"
Apr 19 11:25:08 harv41 rke2[29380]: time="2024-04-19T11:25:08Z" level=info msg="Get job 1, no job"
Apr 19 11:25:11 harv41 rke2[29380]: time="2024-04-19T11:25:11Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:25:08 +0000 UTC"
Apr 19 11:25:11 harv41 rke2[29380]: time="2024-04-19T11:25:11Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: waiting for delete of helm chart for cattle-logging-system/rancher-logging by helm-delete-rancher-logging, requeuing"
Apr 19 11:25:11 harv41 rke2[29380]: time="2024-04-19T11:25:11Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"
Apr 19 11:25:11 harv41 rke2[29380]: time="2024-04-19T11:25:11Z" level=info msg="Get job 1, the creationTimeStamp is 2024-04-19 11:25:08 +0000 UTC"
Apr 19 11:25:14 harv41 rke2[29380]: time="2024-04-19T11:25:14Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:25:08 +0000 UTC"

(2) enable & disable again; a helm-delete job was re-created and left orphaned

Apr 19 11:26:52 harv41 rke2[29380]: time="2024-04-19T11:26:52Z" level=warning msg="Proxy error: write failed: write tcp 127.0.0.1:9345->127.0.0.1:56440: write: connection reset by peer"
Apr 19 11:26:55 harv41 rke2[29380]: time="2024-04-19T11:26:55Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:26:55 harv41 rke2[29380]: time="2024-04-19T11:26:55Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rancher-logging\" not found, requeuing"
Apr 19 11:26:55 harv41 rke2[29380]: time="2024-04-19T11:26:55Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:26:55 harv41 rke2[29380]: time="2024-04-19T11:26:55Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:26:55 harv41 rke2[29380]: time="2024-04-19T11:26:55Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"

// OnRemove
Apr 19 11:27:41 harv41 rke2[29380]: time="2024-04-19T11:27:41Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"
Apr 19 11:27:41 harv41 rke2[29380]: time="2024-04-19T11:27:41Z" level=info msg="Get job 1, no job"

// helm-delete job is generated by generic.ConfigureApplyForObject
Apr 19 11:27:44 harv41 rke2[29380]: time="2024-04-19T11:27:44Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:27:41 +0000 UTC"

// wait job to finish, till next round
Apr 19 11:27:44 harv41 rke2[29380]: time="2024-04-19T11:27:44Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: waiting for delete of helm chart for cattle-logging-system/rancher-logging by helm-delete-rancher-logging, requeuing"

// OnRemove again
Apr 19 11:27:44 harv41 rke2[29380]: time="2024-04-19T11:27:44Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"
Apr 19 11:27:44 harv41 rke2[29380]: time="2024-04-19T11:27:44Z" level=info msg="Get job 1, the creationTimeStamp is 2024-04-19 11:27:41 +0000 UTC"
Apr 19 11:27:47 harv41 rke2[29380]: time="2024-04-19T11:27:47Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:27:41 +0000 UTC"
...
// job is done, and log event ....

// due to the slow-deletion of k8s, the `OnRemove` was called again
Apr 19 11:27:47 harv41 rke2[29380]: time="2024-04-19T11:27:47Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"

// above job was gone
Apr 19 11:27:47 harv41 rke2[29380]: time="2024-04-19T11:27:47Z" level=info msg="Get job 1, no job"

// create a new job
Apr 19 11:27:50 harv41 rke2[29380]: time="2024-04-19T11:27:50Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:27:47 +0000 UTC"

// update to `chart` object failed, as `chart` object was gone on apiserver, and this job is left orphaned
Apr 19 11:27:50 harv41 rke2[29380]: time="2024-04-19T11:27:50Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: unable to update status of helm chart to add uninstall job name helm-delete-rancher-logging, requeuing"

Apr 19 11:37:22 harv41 rke2[29380]: time="2024-04-19T11:37:22Z" level=warning msg="Proxy error: write failed: write tcp 127.0.0.1:9345->127.0.0.1:55328: write: connection reset by peer"

(3) enable & disable again, the old helm-delete job was picked, and no true deleting

Apr 19 11:40:14 harv41 rke2[29380]: time="2024-04-19T11:40:14Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:40:14 harv41 rke2[29380]: time="2024-04-19T11:40:14Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rancher-logging\" not found, requeuing"
Apr 19 11:40:14 harv41 rke2[29380]: time="2024-04-19T11:40:14Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:40:14 harv41 rke2[29380]: time="2024-04-19T11:40:14Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"

// OnRemove
Apr 19 11:41:18 harv41 rke2[29380]: time="2024-04-19T11:41:18Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"

// the above left job is still here
Apr 19 11:41:18 harv41 rke2[29380]: time="2024-04-19T11:41:18Z" level=info msg="Get job 1, the creationTimeStamp is 2024-04-19 11:27:47 +0000 UTC"

// generic.ConfigureApplyForObject cann't identify that this is an old job and a new one is needed
Apr 19 11:41:21 harv41 rke2[29380]: time="2024-04-19T11:41:21Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:27:47 +0000 UTC"

helmchart object is gone, but real resources are left ...

@w13915984028
Copy link

w13915984028 commented Apr 19, 2024

Events, 3 times of ApplyJob and RemoveJob



harv41:/home/rancher # kk get events -A | grep ApplyJob
cattle-logging-system   31m         Normal    ApplyJob                         helmchart/rancher-logging                                     Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging

cattle-logging-system   28m         Normal    ApplyJob                         helmchart/rancher-logging                                     Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging

cattle-logging-system   15m         Normal    ApplyJob                         helmchart/rancher-logging                                     Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging


harv41:/home/rancher # kk get events -A | grep RemoveJob
cattle-logging-system   30m         Normal    RemoveJob                        helmchart/rancher-logging                                     Uninstalled HelmChart 1 using Job cattle-logging-system/helm-delete-rancher-logging, removing resources, CreationTimestamp 2024-04-19 11:25:08 +0000 UTC

cattle-logging-system   28m         Normal    RemoveJob                        helmchart/rancher-logging                                     Uninstalled HelmChart 1 using Job cattle-logging-system/helm-delete-rancher-logging, removing resources, CreationTimestamp 2024-04-19 11:27:41 +0000 UTC

cattle-logging-system   14m         Normal    RemoveJob                        helmchart/rancher-logging                                     Uninstalled HelmChart 1 using Job cattle-logging-system/helm-delete-rancher-logging, removing resources, CreationTimestamp 2024-04-19 11:27:47 +0000 UTC

The last event, it shows clearly that the old job was picked and used

- apiVersion: v1
  count: 1
  eventTime: null
  firstTimestamp: "2024-04-19T11:41:21Z"
  involvedObject:
    apiVersion: helm.cattle.io/v1
    kind: HelmChart
    name: rancher-logging
    namespace: cattle-logging-system
    resourceVersion: "921436"
    uid: 86b7dbc0-23b4-400f-ac49-6ff6a5035af6
  kind: Event
  lastTimestamp: "2024-04-19T11:41:21Z"
  message: Uninstalled HelmChart 1 using Job cattle-logging-system/helm-delete-rancher-logging,
    removing resources, CreationTimestamp 2024-04-19 11:27:47 +0000 UTC  
    // the job CreationTimestamp 2024-04-19 11:27:47 shows this job is old, it is far before the event firstTimestamp: "2024-04-19T11:41:21Z"
  metadata:
    creationTimestamp: "2024-04-19T11:41:21Z"
    name: rancher-logging.17c7ab9eac0ea147
    namespace: cattle-logging-system
    resourceVersion: "921488"
    uid: 925e455a-b5f4-40c6-8ff0-11005ae6db07
  reason: RemoveJob
  reportingComponent: helm-controller
  reportingInstance: harv41
  source:
    component: helm-controller
    host: harv41
  type: Normal

The PODs are left:

harv41:/home/rancher # kk get pods -n cattle-logging-system
NAME                                                      READY   STATUS      RESTARTS   AGE
harvester-default-event-tailer-0                          1/1     Running     0          17m
rancher-logging-77f858bbb7-n7vlm                          1/1     Running     0          17m
rancher-logging-kube-audit-fluentbit-dlf52                1/1     Running     0          17m
rancher-logging-kube-audit-fluentd-0                      2/2     Running     0          17m
rancher-logging-kube-audit-fluentd-configcheck-ac2d4553   0/1     Completed   0          17m
rancher-logging-rke2-journald-aggregator-499bk            1/1     Running     0          17m
rancher-logging-root-fluentbit-dvvcj                      1/1     Running     0          17m
rancher-logging-root-fluentd-0                            2/2     Running     0          17m
rancher-logging-root-fluentd-configcheck-ac2d4553         0/1     Completed   0          17m


harv41:/home/rancher # kk get helmchart -A
NAMESPACE     NAME                  JOB                                CHART   TARGETNAMESPACE   VERSION   REPO   HELMVERSION   BOOTSTRAP
kube-system   rke2-canal            helm-install-rke2-canal                                                                     true
kube-system   rke2-coredns          helm-install-rke2-coredns                                                                   true
kube-system   rke2-ingress-nginx    helm-install-rke2-ingress-nginx                                                             false
kube-system   rke2-metrics-server   helm-install-rke2-metrics-server                                                            false
kube-system   rke2-multus           helm-install-rke2-multus                                                                    true
harv41:/home/rancher # 

enable rancher-logging:

harv41:/home/rancher # enable-rl
addon.harvesterhci.io/rancher-logging patched

harv41:/home/rancher # kk get helmchart -A
NAMESPACE               NAME                  JOB                                CHART             TARGETNAMESPACE   VERSION             REPO                                                     HELMVERSION   BOOTSTRAP
cattle-logging-system   rancher-logging       helm-install-rancher-logging       rancher-logging                     103.0.0+up3.17.10   http://harvester-cluster-repo.cattle-system.svc/charts                 
kube-system             rke2-canal            helm-install-rke2-canal                                                                                                                                           true
kube-system             rke2-coredns          helm-install-rke2-coredns                                                                                                                                         true
kube-system             rke2-ingress-nginx    helm-install-rke2-ingress-nginx                                                                                                                                   false
kube-system             rke2-metrics-server   helm-install-rke2-metrics-server                                                                                                                                  false
kube-system             rke2-multus           helm-install-rke2-multus                                                                                                                                          true
harv41:/home/rancher # 

@w13915984028
Copy link

The return error (with added debug) in step below is:

return chart, fmt.Errorf("unable to update status of helm chart to add uninstall job name %s", chartCopy.Status.JobName)

Apr 19 17:31:16 harv41 rke2[30786]: time="2024-04-19T17:31:16Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: unable to update status of helm chart to add uninstall job name helm-delete-rancher-logging, err Operation cannot be fulfilled on helmcharts.helm.cattle.io "rancher-logging": StorageError: invalid object, Code: 4, Key: /registry/helm.cattle.io/helmcharts/cattle-logging-system/rancher-logging, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 78e8d91d-430e-4c7d-a3f9-156c461eedab, UID in object meta: , requeuing"

It meets our analysis, the helm-chart object in apiserver is gone, but the controller is still trying to update it with the re-created job.

https://pkg.go.dev/k8s.io/apiserver/pkg/storage#StorageError

@brandond
Copy link
Member

That matches the comment on line 267:

if job.Status.Succeeded <= 0 {
// temporarily recreate the chart, but keep the deletion timestamp
chartCopy := chart.DeepCopy()
chartCopy.Status.JobName = job.Name
newChart, err := c.helms.Update(chartCopy)

We know that the HelmChart resource has been deleted, but we're trying to update it anyway. We can't update it if it doesn't exist though, so the comment about calling update to "temporarily recreate the chart" doesn't make sense; thats not something you can do.

This seems to have come from #158 - @aiyengar2 can you take a look and make sure we're not missing anything?

@w13915984028
Copy link

w13915984028 commented Apr 19, 2024

My doubt is:

The current OnRemove is as following, it doesn't know if a helm-delete job has already been created and done, it does not use a state to record & control it; instead, it simply loop

This, by nature, will generate orphaned job possibly.

OnRemove
{
  create job if not existing; also not checking  chartCopy.Status.JobName (means already has job ?)
  sleep 3s
  job not succes? update chart.Status.JobName (does not check if it is already there), return error to trigger next round
  job success? emit event, clean resource (including the job)
}


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants