Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix getting updated user namespaces #2770

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

alichaddad
Copy link
Contributor

Closes

What changed?
When attempting to get the namespaces of a user we lock getting the flow of getting the namespaces from the cache and updating it when necessary. We also only update a cluster namespaces to a user namespaces when the cluster is actually found in the cluster namespaces cache.

Why was this change made?
This change was made to fix an issue where requests for resources like Kustomizations or helm releases would sometimes return empty result.
The issue happens when another request checks on the user namespaces in the cache while a previous request is still building the cache, resulting in a response with partial or empty results. This becomes more apparent when a cluster has an issue(unreachable or has no namespaces for whatever reasons) making empty responses a common occurences. This is due to how the implementation was written:

  • The user namespaces are calculated from the namespaces in a cluster and then filtered based on user permission and stored in a cache with TTL of 30s
  • When a request is made for user namespaces it returns it from the cache, if the cache is empty which would happen periodically, it attempts to update the cache and return the data
  • During that if another request comes it will check on whether the cache is empty or not which means that even if a single key was added it would get the current user namespaces, resulting in partial results. Following requests would have correct responses once the cache is probably built again

The proposed fix is to lock each user request by the user ID. So if a request comes from the same user while another is in process it will only check on the status of the cache after in-flight request is finished.
If a cluster has issues retrieving namespaces or the cluster itself is not reachable it would not be set in the cluster namespaces cache but it will still be set in the user namespaces cache with an empty namespaces list. The PR changes this behavior and only add a cluster namespaces to the user cache when it already exists in the cluster namespaces cache.

How was this change implemented?
By adding a lock when retrieving the user namespaces. Since there is a lock per user ID we had to store the locks in a map. Currently this is not being cleaned up, which might be something that we need to handle down the line.

How did you validate the change?
By adding a non valid cluster to my instance which caused the results of certain APIs to return empty responses consistently before my changes and redoing the same scenario after my changes.

Release notes

Documentation Changes

@alichaddad alichaddad added the bug Something isn't working label Sep 22, 2022
@alichaddad alichaddad requested review from a team, bigkevmcd and foot September 22, 2022 07:14
@ozamosi
Copy link
Contributor

ozamosi commented Sep 22, 2022

Thank you for the excellent explanation of the problem!

  • This puts the UsersNamespaces lock outside of it in the factory itself, while the ClustersNamespaces are still inside the ClustersNamespaces struct. Is there a reason to not make them be the same?
  • This - just like the UsersNamespaces ttl cache itself I've now spotted, so perhaps a separate, bigger issue - uses the user.ID as a unique key. However there's various auth forms that leaves user.ID empty, so in that case the cache/lock will not be able to distinguish between different users.

@alichaddad
Copy link
Contributor Author

Thanks for taking a look!

  • This puts the UsersNamespaces lock outside of it in the factory itself, while the ClustersNamespaces are still inside the ClustersNamespaces struct. Is there a reason to not make them be the same?

Good point we might have to rethink the locking methodology in this part, it seems that there might be too much. In the case of clustersNamespaces it might be a bit different. It gets refreshed periodically consistently without an external trigger. So the need to add an extra lock to it is not the same since the case of the cache getting completely cleared out is rather rare. For userNamespaces this happens constantly and the rebuild is triggered from a user request, since this entails a lot of external actions that might overlap with another request the lock is needed. Does that make sense?

  • This - just like the UsersNamespaces ttl cache itself I've now spotted, so perhaps a separate, bigger issue - uses the user.ID as a unique key. However there's various auth forms that leaves user.ID empty, so in that case the cache/lock will not be able to distinguish between different users.

This would indeed be a problem. Is there another unique identifier that we can use for users?

@foot
Copy link
Contributor

foot commented Sep 22, 2022

Is there another unique identifier that we can use for users?

There might be .toString() kind of thing on the Principal. But it might strip out the token.. which would make it not great for this. It can be potentially sensitive so don't want it in logs etc, maybe we add .Hash() to the principal instead?

@alichaddad alichaddad force-pushed the user_namespaces_fix branch 2 times, most recently from ac6b04d to a8da483 Compare October 4, 2022 11:40
@lasomethingsomething
Copy link
Contributor

@alichaddad @foot Is this still in progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants