Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes 1.30.5 support #23230

Open
karatkep opened this issue Nov 4, 2024 · 21 comments · May be fixed by eclipse-che/che-dashboard#1283
Open

kubernetes 1.30.5 support #23230

karatkep opened this issue Nov 4, 2024 · 21 comments · May be fixed by eclipse-che/che-dashboard#1283
Assignees
Labels
area/che-server area/dashboard area/install Issues related to installation, including offline/air gap and initial setup severity/P1 Has a major impact to usage or development of the system. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach

Comments

@karatkep
Copy link

karatkep commented Nov 4, 2024

Summary

Dear Community,

Could you please help me verify if Eclipse Che 7.93.0 supports Kubernetes 1.30.5? The che-dashboard and che pods stopped working when our Kubernetes cluster was updated to version 1.30.5.

Here is a sample of the error in the che-dashboard:

ERROR[12:03:22 UTC]: [HTTP request failed[
    err: {
      "type": "le",
      "message": "HTTP request failed",
      "stack":
          HttpError: HTTP request failed
              at q._callback (/backend/server/backend.js:8:898957)
              at t._callback.t.callback.t.callback (/backend/server/backend.js:14:1087840)
              at q.emit (node:events:517:28)
              at q.<anonymous> (/backend/server/backend.js:14:1100418)
              at q.emit (node:events:517:28)
              at IncomingMessage.<anonymous> (/backend/server/backend.js:14:1099250)
              at Object.onceWrapper (node:events:631:28)
              at IncomingMessage.emit (node:events:529:35)
              at endReadableNT (node:internal/streams/readable:1400:12)
              at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
      "response": {
        "statusCode": 401,
        "body": {
          "kind": "Status",
          "apiVersion": "v1",
          "metadata": {},
          "status": "Failure",
          "message": "Unauthorized",
          "reason": "Unauthorized",
          "code": 401
        },
        "headers": {
          "audit-id": "6b14e1b5-8a08-41a8-a093-5e00693737a6",
          "cache-control": "no-cache, private",
          "content-type": "application/json",
          "date": "Mon, 04 Nov 2024 12:03:21 GMT",
          "content-length": "129",
          "connection": "close"
        },
        "request": {
          "uri": {
            "protocol": "https:",
            "slashes": true,
            "auth": null,
            "host": "10.1.0.1:443",
            "port": "443",
            "hostname": "10.1.0.1",
            "hash": null,
            "search": null,
            "query": null,
            "pathname": "/apis/org.eclipse.che/v2/checlusters",
            "path": "/apis/org.eclipse.che/v2/checlusters",
            "href": "https://10.1.0.1:443/apis/org.eclipse.che/v2/checlusters"
          },
          "method": "GET",
          "headers": {
            "Accept": "application/json",
            "Authorization": "Bearer MASKED"
          }
        }
      },
      "body": {
        "type": "Object",
        "message": "Unauthorized",
        "stack":
            
        "kind": "Status",
        "apiVersion": "v1",
        "metadata": {},
        "status": "Failure",
        "reason": "Unauthorized",
        "code": 401
      },
      "statusCode": 401,
      "name": "HttpError"
    }

The same issue affects the che pod. It appears that both lost access to the Kubernetes API after the upgrade to version 1.30.5.

ServiceAccounts, Cluster Roles and Bindings are in place for both che-dashboard and che pods

Relevant information

No response

@karatkep karatkep added the kind/question Questions that haven't been identified as being feature requests or bugs. label Nov 4, 2024
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Nov 4, 2024
@tolusha
Copy link
Contributor

tolusha commented Nov 4, 2024

@karatkep
Could you show che pod logs?

I've tried to reproduce on Minikube with Kubernetes 1.31.0, but no luck

@ibuziuk ibuziuk added area/install Issues related to installation, including offline/air gap and initial setup severity/P1 Has a major impact to usage or development of the system. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach area/dashboard area/che-server and removed kind/question Questions that haven't been identified as being feature requests or bugs. status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Nov 5, 2024
@karatkep
Copy link
Author

karatkep commented Nov 6, 2024

@tolusha
According to the che logs, the che pod starts receiving 401 errors from the kube-api exactly one hour after the pod starts working/launches:

06-Nov-2024 08:26:02.136 INFO [main] org.apache.catalina.startup.HostConfig.deployWAR Deployment of web application archive [/home/user/eclipse-che/tomcat/webapps/ROOT.war] has finished in [2,488] ms
06-Nov-2024 08:26:02.138 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["http-nio-8080"]
06-Nov-2024 08:26:02.144 INFO [main] org.apache.catalina.startup.Catalina.start Server startup in [40907] milliseconds
2024-11-06 09:26:32,950[c4d-k5x9l-37628]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@6c199c1d] for cluster [RemoteSubscriptionChannel], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]
2024-11-06 09:26:42,473[4c4d-k5x9l-3460]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@f31944b] for cluster [WorkspaceStateCache], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]
2024-11-06 09:26:47,468[c4d-k5x9l-46003]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@5ed91d32] for cluster [WorkspaceLocks], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]

@karatkep
Copy link
Author

@tolusha, as I can see, the issue is that the token is not being refreshed. It is generated for 1 hour, and after that time, the che-dashboard continues to use it despite its expiration. Is there any way to prompt the che-dashboard to refresh it before using it for kube-api calls?

@tolusha
Copy link
Contributor

tolusha commented Nov 12, 2024

@karatkep
Could you share CheCluster CR?
What OIDC provider do you use?

@karatkep
Copy link
Author

@tolusha,
Yes, of course, I will provide the CheCluster CR. However, I don't think that the issue lies with the CheCluster CR or OIDC. The same version of Eclipse Che 7.93.0 was deployed in two identical AKS clusters (Kubernetes version 1.27.9), and everything was fine until one of the clusters was upgraded to 1.30.5. Immediately after this update, problems with the kube-api started. Reviewing the token used, for example, by the che-dashboard, I see that the expiration field "exp" is always the same and is in the past. From this, I conclude that for Kubernetes version 1.30.5, the token is not being updated.

@karatkep
Copy link
Author

karatkep commented Nov 12, 2024

@tolusha , @ibuziuk , We found the root cause of the issue. In Kubernetes 1.27.9, the token (located at the path /var/run/secrets/kubernetes.io/serviceaccount/token) is issued for one year, although it is refreshed every hour (or more precisely every 50 minutes). At the same time, in Kubernetes 1.30.5, the token is issued for one hour and is also refreshed every 50 minutes. However, Che (che-dashboard, che, and most likely che-gateway) caches this token at startup and uses it. Consequently, in Kubernetes 1.27.9 there is no problem since the token is issued for one year, but in Kubernetes 1.30.5, the problem begins after the first hour from startup because the cached token is used.

@tolusha
Copy link
Contributor

tolusha commented Nov 13, 2024

@karatkep
So, if you restart all pods, Che will continue working, right?

@karatkep
Copy link
Author

@tolusha
Correct, we need to restart the Che pods every hour to ensure they remain operational.

@karatkep
Copy link
Author

@tolusha, @ibuziuk,
Could you please share information and plans regarding this issue? Is everything clear and understandable? Were you able to reproduce it? Are you currently working on a resolution, or do you have plans to start working on it soon?

Just to be on the same page - there is absolutely no pressure from my side. I just want to understand the current status and plans regarding this issue. On my part, I have already used one of the possible workarounds and written a CronJob that restarts the necessary Che pods. If other Eclipse Che users are facing or will face the same issue, I am more than willing to share this workaround.

@ibuziuk ibuziuk moved this to 📅 Planned in Eclipse Che Team A Backlog Nov 15, 2024
@ibuziuk
Copy link
Member

ibuziuk commented Nov 15, 2024

@karatkep Thank you for the follow-up and investigation details - #23230 (comment)

I'm still wondering if the token lifetime is configurable on the k8s end in general?
Do you happen to have the link to the Release Notes, docs, or commit where this change with the lifetime was introduced? Could it be some AKS config?

The issue has been planned for the next sprint (Nov 20 - Dec 10), however, so far @tolusha was not able to reproduce it on vanilla minikube.

@karatkep also contributions from the Community are most welcome if you would like to change or update the caching mechanism in the project ;-)

@karatkep
Copy link
Author

@ibuziuk,
When I was researching this issue, I came across the documentation at https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#tokenrequest-api which contains detailed information about configuring token lifetime. Moreover, I conducted an experiment where I disabled the che-operator (so it wouldn’t interfere with making changes) and used the expirationSeconds to modify the lifetime of the token. I tried setting it to one day or 86400 seconds for the che-dashboard in the deployment. After restarting the che-dashboard pod, I confirmed that the lifetime of the token (located in /var/run/secrets/kubernetes.io/serviceaccount/token) had indeed changed.

P.S. But frankly speaking, I do not like the option of using a long-lived token - it contradicts security best practices. It seems to me that whoever made this change (token lifetime: 1y -> 1h), it is a step in the right direction to use short-lived tokens. And in my opinion, a well-written application should not cache the token indefinitely.

@vinokurig vinokurig moved this from 📅 Planned to 🚧 In Progress in Eclipse Che Team A Backlog Dec 6, 2024
@vinokurig
Copy link
Contributor

I managed to decrease the kubernetes token lifetime to 10 minutes and I confirm that there are Kubernetes connection failure warnings coming every second right after the token expiration time. However, since kubernetes updates the roken in every pod, I could not reproduce the dashboard error and all kubernetes related actions work fine even after the token expiration.
Currently I am working on jgroups-kubernetes che-server dependency update. This library throws the error to the che-server log after the token expires.

@vinokurig
Copy link
Contributor

vinokurig commented Dec 9, 2024

Unfortunately updating the jgroups-kubernetes dependency to latest did not solve the issue with che-server cyclic log warning, filed an upstream issue.
As for the dashboard log error I could not reproduce it with the refreshed kubernetes token, all dashboard kubernetes related actions work fine, e.g PAT token add/list.

@vinokurig
Copy link
Contributor

@karatkep could you please elaborate more on what exactly does not work regardless the logs errors? Can you open dashboard page, navigate to user preferences?

@vinokurig
Copy link
Contributor

To summarize:

  • If kubernetes service account token is refreshed after expiration, all the functionality works as expected except the cyclic error in the che-server logs.
  • The che-server logs error is caused by the jgroups-kubernetes dependency. The dependency is not used for the current functionality and can be removed as a leftover, however we should consider either to update the dependency, when a new version with the fix is available, or to remove the dependency as a leftover and chek that it does not break the current functionality.
  • We are going to update the fabric8 kubernetes client to latest

@ibuziuk
Copy link
Member

ibuziuk commented Dec 10, 2024

@karatkep my understanding is that so far @vinokurig was not able to reproduce the error even with the short-lived token. Steps to reproduce would be highly appreciated.

Basically, all k8s interactions are happening using Fabric8-Kubernetes-Client for che-server and we plan to bump it to version 7.0.0 next sprint.
cc @manusa maybe you have some input on this situation? do we need to care about updating the token / /var/run/secrets/kubernetes.io/serviceaccount/token, or client handles the update under the hood - #23230 (comment) ?

@manusa
Copy link

manusa commented Dec 11, 2024

cc @manusa maybe you have some input on this situation? do we need to care about updating the token / /var/run/secrets/kubernetes.io/serviceaccount/token, or client handles the update under the hood - #23230 (comment) ?

I understand that the Kubernetes Client in use is 6.10.0.

In this case, yes there's a TokenRefreshInterceptor that reloads the config in case there is an auth client error in the HTTP response.

https://github.com/fabric8io/kubernetes-client/blob/9101a2fa4a8f912ff6cda23e4d4b59895ccdc755/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/utils/TokenRefreshInterceptor.java#L123-L126

The interceptor logic will work and reload the Config as long as the Config was not provided manually.
Does this ring any bell? Setting a breakpoint in the mentioned lines of code should allow you to debug what's going on the moment the authorization fails.

@karatkep
Copy link
Author

Hello @ibuziuk, @vinokurig. Please allow me to gather more details regarding this case. I will share them later today or tomorrow.

@karatkep
Copy link
Author

@karatkep could you please elaborate more on what exactly does not work regardless the logs errors? Can you open dashboard page, navigate to user preferences?

@vinokurig, the dashboard issue arises when an user attempts to start the devworkspace. Please see the screenshot below:
image

The endpoint /dashboard/api/devworkspace/running-workspaces-cluster-limit-exceeded is failing to function because it attempts to call the Kubernetes API endpoint /apis/org.eclipse.che/v2/checlusters, but it returns a 401 error due to the token being expired.

@karatkep
Copy link
Author

Hello @vinokurig, just wanted to check if you need anything else from my side to unblock your investigation.

@vinokurig
Copy link
Contributor

Hello @karatkep, sorry for the late response, I managed to reproduce the unauthorized error on dashboard, investigating ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/che-server area/dashboard area/install Issues related to installation, including offline/air gap and initial setup severity/P1 Has a major impact to usage or development of the system. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach
Projects
Status: 🚧 In Progress
Development

Successfully merging a pull request may close this issue.

7 participants