Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on job with multiple pods #193

Open
ChevronTango opened this issue Nov 25, 2022 · 1 comment
Open

Error on job with multiple pods #193

ChevronTango opened this issue Nov 25, 2022 · 1 comment
Assignees
Labels
bug Something isn't working done Code pushed to develop branch
Milestone

Comments

@ChevronTango
Copy link

ChevronTango commented Nov 25, 2022

Whilst experimenting with kube-fledged we added and destroyed a number of nodes in our test cluster, and a handful of pods got stuck. Kube-fledged logged an error reporting that a job had multiple pods in it, and then never reran again. Whether this is a fluke of our setup or not I couldn't say.

I1125 08:32:14.739894 1 image_manager.go:472] Job my-cache-tld8k created (pull:- my-registry.com/my-image:latest --> ip-10-1-7-224, runtime: containerd://1.5.8)
E1125 08:37:14.777438 1 image_manager.go:241] More than one pod matched job my-cache-tld8k
E1125 08:37:14.778075  1 image_manager.go:324] Error from updatePendingImageWorkResults(): more than one pod matched job my-cache-tld8k

those are the last logs the controller ever prints out.

Does the controller have a liveness check that could detect this kind of crash and restart? Can the controller also handle jobs that have multiple pods, some stuck in terminating, and others stuck waiting for a node that has been destroyed? The latter is quite likely to occur in environments with frequent scale up and down. Would the controller be able to clear up all the jobs on a restart or would there be jobs left in the cluster forever?

Very much liking the app, so keen to help improve it for some of the above scenarios.

@senthilrch
Copy link
Owner

@ChevronTango : Thank you for reporting this issue.

The controller works in this fashion: There's a master routine and an image manager routine. Both the master and image manager communicate through work queues. Master places image pull/delete requests in a queue and image manager places image pull/delete responses in another queue.

In case the image manager encounters a situation where a Job happens to have multiple pods, it considers it as an error and stops further processing, without sending back a response. hence the controller gets stuck waiting for the response from image manager. I'll modify the login in image manager to log this error and continue processing and finally send a response back.

Yes , when the controller re-starts if there are any image pull/delete (dnagling) jobs, those are deleted. Also when it sees an ImageCache in processing status, the status is reset as well. So overall the controller is fairly resilient in this aspect.

You made a good point with the liveness/readiness probe. It's a good idea to add liveness/readiness probe to improve the robustness and observability of kube-fledged.

@senthilrch senthilrch self-assigned this Mar 10, 2023
@senthilrch senthilrch added the bug Something isn't working label Mar 10, 2023
@senthilrch senthilrch added this to the v0.11.0 milestone Mar 10, 2023
senthilrch added a commit that referenced this issue Mar 10, 2023
@senthilrch senthilrch added the done Code pushed to develop branch label Mar 10, 2023
senthilrch added a commit that referenced this issue Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working done Code pushed to develop branch
Projects
None yet
Development

No branches or pull requests

2 participants