Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE: An attempt at restarting the controller manager without restarting the whole operator binary #1117

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

metlos
Copy link
Contributor

@metlos metlos commented Jan 7, 2025

This is not meant for merging.

It is an experiment at soft-restarting the controller manager within the process of the host operator so that we don't have to rely on external tooling (i.e. ksctl) to manually restart the operator when the set of the registered members changes. The operator itself could detect these occasions and "heal" itself be soft-restarting so that further reconciliations work with the up-to-date set of members. In the longer terms this could also give us the opportunity to consolidate the explicit set of member clusters established at the operator start with the "cluster cache" used in the toolchaincluster controller and health checks.

Let's discuss here whether this potential approach has some merit and whether this solution is actually what we'd want.

At the moment, the restarts are just triggered every minute. This is of course not how it would be working for real - we would request a restart only when a new member is detected or when an existing one is removed.

The frequent restarts are put in place so that we can judge the effects on the memory consumption and/or see any leaks that might be happening in the controller manager due to the way we restart it.

I have ran the experimental version of the host-operator in a cluster and was continuously measuring the memory consumption for 5 hours after creating 5000 user signups in the cluster. This means that a non-trivial amount of objects needed to be loaded in the cache managed by the controller manager and therefore we should be seeing the effects of the leaks if there were any.

So, in the experiment, the host-operator was reconciling 5000 user signups, restarting every minute, for 5 hours. The results are compiled in this spreadsheet: https://docs.google.com/spreadsheets/d/1maSd_KUUYBwA3U1d3UdSMPaQkEwXYmbhWg4duCBvfO0/edit?gid=0#gid=0.

The memory consumption seems quite stable to me:
obrazek

…e whole operator binary

+ a very nasty way of seeing if it works - the controller manager should
restart every minute :)
Copy link

openshift-ci bot commented Jan 7, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

openshift-ci bot commented Jan 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: metlos

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jan 7, 2025
@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

sonarqubecloud bot commented Jan 7, 2025

Copy link

codecov bot commented Jan 7, 2025

Codecov Report

Attention: Patch coverage is 27.72586% with 232 lines in your changes missing coverage. Please review.

Project coverage is 79.38%. Comparing base (9237d9c) to head (7bf2ff2).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
main.go 0.00% 212 Missing ⚠️
pkg/restart/restart.go 81.65% 16 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1117      +/-   ##
==========================================
- Coverage   79.44%   79.38%   -0.07%     
==========================================
  Files          78       79       +1     
  Lines        7785     7903     +118     
==========================================
+ Hits         6185     6274      +89     
- Misses       1422     1447      +25     
- Partials      178      182       +4     
Files with missing lines Coverage Δ
pkg/restart/restart.go 81.65% <81.65%> (ø)
main.go 0.00% <0.00%> (ø)

Copy link
Contributor

@MatousJobanek MatousJobanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this experimental PR, it looks very promising.

setupLog.Error(err, "unable to create controller", "controller", "SpaceBindingRequest")
SegmentClient: segmentClient,
ClusterManager: capacity.NewClusterManager(namespace, mgr.GetClient()),
}).SetupWithManager(ctx, mgr); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be careful which context you use here. This should be taken from the caller (the startManager)

Comment on lines +20 to +22
type Restarter interface {
RestartNeeded()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need an interface?

Comment on lines +42 to +46
if s.runningContext == nil {
// this actually means that the start failed, but we need a non-nil value for the runningContext
// so that we can select on it below
s.runningContext = context.TODO()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the start fails, why do we actually care and do not return an error here?

debug("doStart: about to invoke mgr.Start()")
if startErr := mgr.Start(s.runningContext); startErr != nil {
debug("doStart: mgr.Start() returned error. Sending it.")
err <- startErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the manager cannot be started, then we can just simply force-kill, right?

Suggested change
err <- startErr
os.Exit(1)

There is nothing we could try to recover. Or is there any other cleanup we would need to do in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one more note/question - I understand the usage of the channel in this case, but I don't understand why we need to use the channel for all other error cases returned from this method. Is it to only unify the error handling, or is there any other reason?


// now, let's see what caused the cancellation

if !errors.Is(context.Cause(s.runningContext), restartNeededError) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking the cause of the cancel function and comparing it is a bit weird to me - it reminds me catching exceptions in Java. Not sure if this is a best practice in Go though.

Instead of using "CancelCause Context" you could use the standard "Cancel context" and add another field to the StartManager stating if the restart was triggered or not (in the same way as you do it with the running field. And have another case when the root context itself:

		case <-ctx.Done():
			debug("Start: context cancelled, cancellation from above")
			<-finished
			debug("Start: quitting")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants