Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend replicaStatus object with information about evicted pods caused by exceeding emptyDir sizeLimit #621

Open
nilsgstrabo opened this issue Apr 23, 2024 · 1 comment
Labels
🤔 refinement needed This needs more details

Comments

@nilsgstrabo
Copy link
Contributor

nilsgstrabo commented Apr 23, 2024

Related to use ReadOnlyFileSystem

If an emptyDir in a container exceeds the sizeLimit, Kubernetes will forcefully kill the container and set the pod phase to Failed. K8S then creates a new Pod to run the container istead of reusing the existing Pod.

We should include info about these events (stored in pod.status) in replicaList returned by radix-api.

Another issue is that these failed pods interfere with the caulculation of the component status. A pod in a failed phase due to emptyDir violations will cause the component status to be Reconciling. Not sure what the status should be. The user should be able to easily see that there are issuer, but I feel that Reconciling is wrong. A component can be in one of the following statuses: "Stopped", "Consistent", "Reconciling", "Restarting", "Outdated". Not sure if any of the fit this situation.

Also, it would be useful for the user to be able to cleanup(delete?) Pods in failed state.

An example of a failed Pod:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-04-23T11:26:40Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-04-23T11:28:42Z"
    reason: PodFailed
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-04-23T11:28:42Z"
    reason: PodFailed
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-04-23T11:26:40Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://ae1a03c55355b5f5bc29b6c6990727ee87a6d8b845e2d8a795487d78bc193712
    image: radixdev.azurecr.io/oauth-demo-dev-simple:e2pmj
    imageID: radixdev.azurecr.io/oauth-demo-dev-simple@sha256:ce0827dd93e2dc2d96ac7a941cc2bbf9687ec5e93a8c8218a1d90784c8484859
    lastState: {}
    name: simple
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://ae1a03c55355b5f5bc29b6c6990727ee87a6d8b845e2d8a795487d78bc193712
        exitCode: 137
        finishedAt: "2024-04-23T11:28:42Z"
        reason: Error
        startedAt: "2024-04-23T11:26:41Z"
  hostIP: 10.5.3.108
  message: 'Usage of EmptyDir volume "radix-vm-tmp" exceeds the limit "5M". '
  phase: Failed
  podIP: 10.5.3.130
  podIPs:
  - ip: 10.5.3.130
  qosClass: Burstable
  reason: Evicted
  startTime: "2024-04-23T11:26:40Z"
@nilsgstrabo nilsgstrabo added the 🤔 refinement needed This needs more details label Apr 23, 2024
@emirgens
Copy link
Contributor

emirgens commented Sep 10, 2024

Investigate if Pod retention period can be used
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/

@emirgens emirgens added 🤔 refinement needed This needs more details and removed 🤔 refinement needed This needs more details labels Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤔 refinement needed This needs more details
Projects
None yet
Development

No branches or pull requests

2 participants