Error on job with multiple pods #193

ChevronTango · 2022-11-25T01:36:49Z

Whilst experimenting with kube-fledged we added and destroyed a number of nodes in our test cluster, and a handful of pods got stuck. Kube-fledged logged an error reporting that a job had multiple pods in it, and then never reran again. Whether this is a fluke of our setup or not I couldn't say.

I1125 08:32:14.739894 1 image_manager.go:472] Job my-cache-tld8k created (pull:- my-registry.com/my-image:latest --> ip-10-1-7-224, runtime: containerd://1.5.8)
E1125 08:37:14.777438 1 image_manager.go:241] More than one pod matched job my-cache-tld8k
E1125 08:37:14.778075  1 image_manager.go:324] Error from updatePendingImageWorkResults(): more than one pod matched job my-cache-tld8k

those are the last logs the controller ever prints out.

Does the controller have a liveness check that could detect this kind of crash and restart? Can the controller also handle jobs that have multiple pods, some stuck in terminating, and others stuck waiting for a node that has been destroyed? The latter is quite likely to occur in environments with frequent scale up and down. Would the controller be able to clear up all the jobs on a restart or would there be jobs left in the cluster forever?

Very much liking the app, so keen to help improve it for some of the above scenarios.

The text was updated successfully, but these errors were encountered:

senthilrch · 2023-03-10T17:28:17Z

@ChevronTango : Thank you for reporting this issue.

The controller works in this fashion: There's a master routine and an image manager routine. Both the master and image manager communicate through work queues. Master places image pull/delete requests in a queue and image manager places image pull/delete responses in another queue.

In case the image manager encounters a situation where a Job happens to have multiple pods, it considers it as an error and stops further processing, without sending back a response. hence the controller gets stuck waiting for the response from image manager. I'll modify the login in image manager to log this error and continue processing and finally send a response back.

Yes , when the controller re-starts if there are any image pull/delete (dnagling) jobs, those are deleted. Also when it sees an ImageCache in processing status, the status is reset as well. So overall the controller is fairly resilient in this aspect.

You made a good point with the liveness/readiness probe. It's a good idea to add liveness/readiness probe to improve the robustness and observability of kube-fledged.

senthilrch self-assigned this Mar 10, 2023

senthilrch added the bug Something isn't working label Mar 10, 2023

senthilrch added this to the v0.11.0 milestone Mar 10, 2023

senthilrch added a commit that referenced this issue Mar 10, 2023

Error on job with multiple pods #193

2925327

senthilrch added the done Code pushed to develop branch label Mar 10, 2023

senthilrch added a commit that referenced this issue Mar 10, 2023

Error on job with multiple pods #193

7f678b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on job with multiple pods #193

Error on job with multiple pods #193

ChevronTango commented Nov 25, 2022 •

edited

Loading

senthilrch commented Mar 10, 2023

Error on job with multiple pods #193

Error on job with multiple pods #193

Comments

ChevronTango commented Nov 25, 2022 • edited Loading

senthilrch commented Mar 10, 2023

ChevronTango commented Nov 25, 2022 •

edited

Loading