Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4872: Harden Kubelet serving cert validation #4911

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

g-gaston
Copy link

@g-gaston g-gaston commented Oct 9, 2024

  • One-line PR description: Add first version of doc
  • Other comments: I left a few TODOs with things I think can use a discussion.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 9, 2024
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Oct 9, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: g-gaston
Once this PR has been reviewed and has the lgtm label, please assign ritazh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 9, 2024
@g-gaston g-gaston changed the title KEP-4872: Add Harden Kubelet serving cert validation KEP-4872: Harden Kubelet serving cert validation Oct 9, 2024
# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: KubeletCertCNValidation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: NodeCertificateNameValidation?

The API server is verifying a certificate presented by a node, whatever software that node is running (which, OK, is very likely to be kubelet).


Provided an actor with control of a node can impersonate another node, the impact would be:

* Break confidentiality of the requests sent by the Kube-API server to the kubelet (e.g kubectl exec/logs).These are usually user-driven requests. That gives the threat actor the possibility of producing incorrect or mis-leading feedback. In the exec case, it could allow a threat actor to issue prompts for credentials. In addition, the exec commands might contain user secrets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this actor is already owning a node in the cluster then exec and logs does not need to impersonate a node, they just target pods in a node, and require the kubelet to terminate the connection

#### Metrics

In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_errors` that will track the number of errors due to the new CN validation.
If the feature gate is disabled, we will still add the validation code to the HTTP transport, however, if the validation fails we won't return an error, we will just increment the metric counter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... normally we want all code for the new feature to be completely inert when the gate is disabled. That way if there are bugs or non-functional regressions (like performance issues, etc), disabling the feature gate gives a way to mitigate that.

I like the idea of surfacing the metric of how many CNs would fail the validation, but I would only do the check and publish the metric if the feature gate is enabled. Someone wanting to dry run could enable the gate to active the code but pass --disable-kubelet-cert-cn-validation=false to avoid failing requests if the validation fails.


##### e2e tests

End-to-end tests won't be needed as unit and integration tests will cover all the scenarios.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you will be able to do everything with integrations , but since a lot of e2e test use logs, exec and portforward, I assume it will be implicitly covered by those test once the feature gate is enabled

This vulnerability can be exploited through ARP poisoning or other routing attacks, allowing a rogue node to obtain a certificate for an IP it does not own and reroute traffic to itself.

When the Kube API server connects to a kubelet, it verifies that the serving certificate is signed by a trusted CA and that the IP or hostname it’s connecting to is included in the certificate's SANs.
If a rogue node obtained a certificate for an IP it does not own and reroute traffic to itself, it would be able to impersonate a Node that reports that IP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand on the "reroute traffic to itself" problem?
I may not get all the full details of the scenario, nodeA with IP1 is removed but somehow actorB manages to get its certificate, spawn a nodeB with nodeA name? IP2 or IP1??? , join it to the cluster and set the address IP1 on node.status.addresses ?


We will remove the metric once the feature is GA.

> TODO: let's discuss this in the review. We could consider adding the node name to the metric or even keeping the metric post GA if it's valuable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node name has the cardinality problem, but leaving the metric sounds like a good thing to me, it can help to detect issues with the node certificates due to bugs per example

#### Alpha

* Add feature flag for gating usage, off by default
* Add flag to disable extra validation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember you need to add the flag only if the feature flag is enabled, otherwise if the feature does not progress removing the flag can not be possible without breaking the users that are already setting it

@aojea
Copy link
Member

aojea commented Nov 5, 2024

+1 sounds a great addition, simple and very effective

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: KEP Backlog
Development

Successfully merging this pull request may close these issues.

5 participants