Add a max failed CNRs threshold to Nodegroups #88

vincentportella · 2024-08-23T04:23:39Z

Adding a new maxFailedCycleNodeRequests attribute to the NodeGroup spec which defines how many failed CNRs are allowed for a nodegroup before observer will stop generating new ones. The equivalent to the current behaviour is maxFailedCycleNodeRequests: 0. This is not a breaking change.
The Cyclops controller will now delete any sibling failed CNRs when a new one reaches the Successful phase.
The cyclops_cycle_node_requests_by_phase metric has been update to include a new nodegroup attribute which is the name of the nodegroup the CNR was generated from. Manually created CNRs that don't match a nodegroup will include the attribute with an empty string.

…to vportella/add-max-failed-cnrs-threshold

mwhittington21 · 2024-08-29T00:20:39Z

pkg/controller/cyclenoderequest/transitioner/test_helpers.go

@@ -30,6 +44,10 @@ type Transitioner struct {

 	CloudProviderInstances []*mock.Node
 	KubeNodes              []*mock.Node
+
+	extrakubeObjects []client.Object


nit: extraKubeObjects

mwhittington21 · 2024-08-29T00:37:37Z

pkg/controller/cyclenoderequest/transitioner/transitions_successful_test.go

+		},
+	}
+
+	// Failed CNR for the same nodegroup in a different namespace


Are CNRs intended to be namespaced? I can't see much value in separating them by namespace like this but being able to act on the same set of backing nodes provided by a cloud provider.

That would go back to when they were designed. I'm not particularly sure, though changing that is out of scope here.

mwhittington21 · 2024-08-29T00:38:18Z

pkg/controller/cyclenoderequest/transitioner/util.go

@@ -536,3 +536,38 @@ func (t *CycleNodeRequestTransitioner) validateInstanceState(validNodeGroupInsta

 	return false, nil
 }
+
+// deleteFailedSiblingCNRs finds the CNRs generated for the same nodegroup as
+// the one in the calling transitioner. It filters for deleted CNRs in the same


nit: it's not a calling transitioner, it's the parent struct technically. I would just say "the one in the transitioner"

mwhittington21 · 2024-08-29T00:53:59Z

pkg/observer/controller.go

 			}
 		}
-		if found {
+
+		if dropNodeGroup {
 			klog.Warningf("nodegroup %q has an in progress CNR.. skipping this nodegroup", nodeGroup.Name)


This log message now doesn't match reality. dropNodeGroup is being set to true when the number of failed CNRs exceeds the threshold, and not when it has an in progress CNR.

I will add a second log line there because Failed is defined as "in progress" in the existing implementation. I'm adding functionality to skip a certain number of them from counting as "in progress" so I will update for that.

Add a max failed CNRs threshold to a Nodegroup

c40d602

vincentportella changed the title ~~Add a max failed CNRs threshold to a Nodegroup~~ Add a max failed CNRs threshold to Nodegroups Aug 23, 2024

vincentportella self-assigned this Aug 23, 2024

vincentportella requested review from mwhittington21 and awprice August 23, 2024 04:24

Adjust function to drop nodegroups + add tests

8fc6b65

awprice previously approved these changes Aug 26, 2024

View reviewed changes

Merge branch 'master' of https://github.com/atlassian-labs/cyclops in…

bffa3ce

…to vportella/add-max-failed-cnrs-threshold

vincentportella dismissed awprice’s stale review via bffa3ce August 26, 2024 05:02

vincentportella added 2 commits August 27, 2024 14:23

Change CNR count by phase metric to include nodegroup name attribute

90f3b95

Merge branch 'master' of https://github.com/atlassian-labs/cyclops in…

92c98d5

…to vportella/add-max-failed-cnrs-threshold

mwhittington21 requested changes Aug 29, 2024

View reviewed changes

update for comments

1d47b9c

mwhittington21 previously approved these changes Aug 29, 2024

View reviewed changes

Make sure Failed cnrs created after Successful are not deleted

58fb8ae

vincentportella dismissed mwhittington21’s stale review via 58fb8ae September 2, 2024 00:01

mwhittington21 previously approved these changes Sep 2, 2024

View reviewed changes

nit: updating comment

4eedd00

vincentportella dismissed mwhittington21’s stale review via 4eedd00 September 2, 2024 00:14

mwhittington21 approved these changes Sep 2, 2024

View reviewed changes

MinyiZ self-requested a review September 2, 2024 00:23

MinyiZ approved these changes Sep 2, 2024

View reviewed changes

vincentportella merged commit dc687f3 into master Sep 2, 2024
3 checks passed

vincentportella deleted the vportella/add-max-failed-cnrs-threshold branch September 2, 2024 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a max failed CNRs threshold to Nodegroups #88

Add a max failed CNRs threshold to Nodegroups #88

vincentportella commented Aug 23, 2024 •

edited

Loading

mwhittington21 Aug 29, 2024

mwhittington21 Aug 29, 2024

vincentportella Aug 29, 2024

mwhittington21 Aug 29, 2024

mwhittington21 Aug 29, 2024

vincentportella Aug 29, 2024

Add a max failed CNRs threshold to Nodegroups #88

Add a max failed CNRs threshold to Nodegroups #88

Conversation

vincentportella commented Aug 23, 2024 • edited Loading

mwhittington21 Aug 29, 2024

Choose a reason for hiding this comment

mwhittington21 Aug 29, 2024

Choose a reason for hiding this comment

vincentportella Aug 29, 2024

Choose a reason for hiding this comment

mwhittington21 Aug 29, 2024

Choose a reason for hiding this comment

mwhittington21 Aug 29, 2024

Choose a reason for hiding this comment

vincentportella Aug 29, 2024

Choose a reason for hiding this comment

vincentportella commented Aug 23, 2024 •

edited

Loading