Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Gating Flux reconciliation #3158

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

[RFC] Gating Flux reconciliation #3158

wants to merge 2 commits into from

Conversation

stefanprodan
Copy link
Member

@stefanprodan stefanprodan commented Sep 29, 2022

Add proposal for introducing a gating mechanism to Flux that would offer a way for cluster admins and other teams involved in the release process to manually approve the rollout of changes onto clusters.

In addition, Flux could offer a way to define maintenance time windows and other time-based gates, to allow a better control of applications and infrastructure changes to critical system.

Note that this proposal is WIP, we need to cover more use-cases from #870

@stefanprodan stefanprodan added the area/rfc Feature request proposals in the RFC format label Sep 29, 2022
@hassenius
Copy link

I'm curious whether this proposal would easily be able to support

  1. When the gate is closed, will the kustomization/helmrelease controllers still be able to perform drift avoidance (i.e. continue to reconcile currently desired state)

  2. That the gate is open "for a specific revision/change", in other words that when the gate is open it can be known "what" the gate is open for?


```yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose gates are to be applied at sources only instead of Kustomization and HelmRelease objects, as otherwise may face the non-trivial challenges below:

  • Incompatible desired state when Kustomization / HelmRelease objects are gated whilst others are not, specially when they depend on the same Flux source that is not gated.
    • Example 1: Kustomization objects app2 and app1 depends on GitRepository team-source. A change is merged into the repository and reconciled for object team-source, however app2 has a close Gate and app1 does not. In this case the desired state at both Git repository and the Flux source in the cluster are valid, but only one Kustomizations is being applied leading to an invalid observed state in the cluster. Same applies if both Kustomizations are behind a Gate which has its state changed before the second object is reconcilied (e.g. example 2 below).
  • Enforcing the desired state which was allowed before the Gate closing (User Story 2) will be difficult to manage from any point beyond the source, which could lead to the cluster to be in an invalid state.
    • Example 1: user manually deletes Kubernetes object managed in the hope Flux will re-create it.
    • Example 2: two kustomizations (parent and child) having the same Gate, the latter has a dependsOn pointing to the former. If the Gate closes after parent but before child is applied, the cluster observed state may be invalid (considering the desired state at source).

- How can an operator determine if the feature is in use?
- Are there any drawbacks when enabling this feature?
-->

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few ideas on design details to be discussed:

Multiple Gates

Flux objects that support gating can specify multiple gates. By default,
all gates specified must be open for the reconciliation to go ahead. To
change the behavior spec.gates.require can be set to oneOf instead:

apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
  name: flux-system
  namespace: flux-system
spec:
  gates:
    # all (default): all gates must be open for the reconciliation to go ahead.
    # oneOf: at least one of the gates must be open for the reconciliation to go ahead.
    require: oneOf # <all|oneOf>
    refs:
    - change-freeze # gate that enforces a change freeze time window
    - bypass-signoff # gate that allows other gates to be overriden.

When oneOf is used, a single open Gate is required for reconcilations
to proceed.

Recovering from wiped Storage

The source artifact storage lives in the running source controller instance. If its Pod
is recreated, all the artifacts will need to be reacquired. This process happens
automatically as part of each source reconciliation, however, when sources
are restricted by a gate they will require a new process.

In such situation the controller will need to fetch the version used before the Gate was
closed, and that was in the storage before it being wiped. Not all types will support this
and therefore here are the outcomes expected:

  • GitRepository: restores artifact based on the last known revision (commit ID).
  • S3 Buckets: fail reconciliation with "artifact not found artifact whilst under closed Gate".
  • OCI: TBC.
  • Helm Repository: TBC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OCI we have the upstream digest in status, so like with Git, we can use it to pull the data.

```

You could also schedule "No Deploy Fridays" with a CronJob that closes the `maintenance` gate at `0 0 * * FRI`.

Copy link

@hassenius hassenius Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvmnd...moved...

> As a member of the SRE team, I want existing deployments to still be
> reconciled during a change freeze.

Gates can be used to block Flux sources from being refreshed, resulting in Flux

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify that block Flux sources from being refreshed in this case means the gate prevent the source controller from making the new assets available, rather than to suspend the polling in the source controller?

It is desirable for the source to still be able to detect changes, such that in-cluster logic that is tied to the gating mechanisms can be notified of available changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it stopped polling, it would avoid all the resource consumption (CPU, memory, Network and storage) for an operation that would not cause any side effect to the cluster, which would stream line the controllers to better handle reconciliations that were not gated.

@hassenius is there any specific scenario you have in mind?


### Goals

- Offer a dedicated API for defining time-based gates in a declarative manner.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that the gates should not be exclusively time based, but also support specific revisions

This would ensure that Gate changes would not impact the eventual consistency of
mid-flight reconciliations that were already deployed in the cluster. Flux would also
continue to re-create Flux managed objects that were manually deleted from the cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose an additional use case, bringing in the reference dimension in addition to the time dimension for gates, such that a gate can be defined as open to a specific reference, either in spec, or as an annotation

Story 3

As a multi-cluster operator, I want to control how cluster configuration changes
are rolled out across the cluster estate from a single configuration source

Gates can be used to control the timing and order of where configurations are reconciled

For example

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: approved-revisions
  namespace: flux-system
spec:
  openTo: 4f08fc93e31

or

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  annotations:
    open.gate.fluxcd.io/revision: 4f08fc93e31
  name: approved-revisions
  namespace: flux-system
spec:
  default: closed

@hiddeco
Copy link
Member

hiddeco commented Oct 7, 2022

Have been thinking about the current proposal, and am wondering if the ecosystem would benefit from a simple static Gate object which is actually controlled by a Gatekeeper (which knows about the mechanics around e.g. a CronJob).

By doing this, it would be much easier for others to write their own "gatekeeper" which e.g. listens to button presses in a UI.

objects.

## Design Details

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In highly regulated environments Gating can be used to ensure specific processes were observed before a new change made its way to a given environment. Flux controllers that support such mechanism should extend their log messages to express what Gates were taken into account at the time of a reconciliation, and their states.

@rupelu
Copy link

rupelu commented Nov 7, 2022

Just throwing in another idea. It would be great if a Gate could evaluate a metric. This way we could decline reconciliation on high load or when some long running task is not yet completed. A metrics based Gate would make checks, whether a release can be rolled out, more automatic. Of course as already mentioned by @pjbgf it would be highly appreciated if we would know the reason why a reconciliation did not happen and which Gate was involved - maybe even what the evaluation of the said metric was.

Also it would be great to force-open one or all Gates. This could make especially sense when a rollout starts with Gates opened but errors come up during the rollout process and the maintenance window reaches its end. Sometimes the way forward is better than a rollback and Operators should be able to decide that the release should be fixed and rolled out never the less. Maybe this could be done with a label or annotation overriding the close-time of the Gate.


### Goals

- Offer a dedicated API for defining time-based gates in a declarative manner.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Offer a dedicated API for defining time-based gates in a declarative manner.
- Offer a dedicated API for defining gates in a declarative manner.


The `Gate` API would replace Flagger's current
[manual gating mechanism](https://docs.flagger.app/usage/webhooks#manual-gating).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional audit trail for this feature:

  • Enriching existing logs so that actions that can be gated should log:
    • What gate(s) were checked (if any).
    • What gate(s) caused the final result of (not) going ahead with the action.
    • If multiple gates can be defined, changes to which gates are being observed and what logical operators (and/or) are being used, should be descriptively logged.
  • Logs for the new Gate controller:
    • Descriptive summary of the gate state at creation time.
    • Any change of Gate state should log a descriptive summary.
  • The additional logs above would be an "opt-in" feature, so that users that do not require this level of governance don't have to process the extra data.

@pjbgf
Copy link
Member

pjbgf commented Nov 25, 2022

Have been thinking about the current proposal, and am wondering if the ecosystem would benefit from a simple static Gate object which is actually controlled by a Gatekeeper (which knows about the mechanics around e.g. a CronJob).

By doing this, it would be much easier for others to write their own "gatekeeper" which e.g. listens to button presses in a UI.

I agree with @hiddeco, this would make the implementation quite extensible. Users could implement whatever logic they need to toggle the state on top of the static Gate.

In-tree "GateKeeper" implementations could be added based on most popular requests amongst the community.

@pjbgf
Copy link
Member

pjbgf commented Nov 25, 2022

Just throwing in another idea. It would be great if a Gate could evaluate a metric. This way we could decline reconciliation on high load or when some long running task is not yet completed. A metrics based Gate would make checks, whether a release can be rolled out, more automatic. Of course as already mentioned by @pjbgf it would be highly appreciated if we would know the reason why a reconciliation did not happen and which Gate was involved - maybe even what the evaluation of the said metric was.

@rupelu The Gate happens before a new state is applied into the cluster. So I think they are a better fit for defining whether or not sources should be refreshed, for the reasons I mentioned here. However, you could have similar results by using Flagger to do the progressive roll-out side of things for you.
If the release was broken, no (or few) users would be redirected to the new version. Rolling fixes back/forward, would still require the Gate to be opened though.

@pjbgf
Copy link
Member

pjbgf commented Dec 22, 2022

I have worked out an example on what this could look like if we were to follow @hiddeco's suggestion on using static gates. This is what I got to: https://hackmd.io/N-xK-y5fTCiFFnusF22Sfw. I talked with a few users to ascertain whether this would mitigate their key use cases, and so far the response has been positive.

The example is not meant to define the implementation details, but rather to flesh out the workflow so folks can raise concerns sooner rather than later.

@haggishunk
Copy link

Counterpoint:

I am fond of the shifting left that flux encourages with its watchful polling of sources like git repos and image registries. Maybe it's a blocker to widespread community adoption to not have a way to put a human in the loop, but in some cases that human that pushes the deploy button is not the one that merged the app code. Flux is differentiated from many other imperative pipeline CD tools by this very nature which sets the "human in the loop" to be those that commit artifacts to the watched sources of truth and the policies/controls around them.

Let's take the scenario of putting a gate up on a Friday afternoon, having the source evolve with new commits/pushes. Some developers finish their projects and update the source of truth, say with refs to new images. On Monday when the gate is lifted the accumulation of several code changes are released. If things perform unexpectedly, what is the last good state? The last commit before the gate went up, sure, but this is cross domain. I would much rather roll back a merge train that hit Monday morning with reverting commits. You will have better team engagement with the release if you remove the safety of a gate window.

Promoting the suspend feature to a more first class implementation seems to strike the right balance IMHO.

Disclaimer:

Please forgive my rant, but to me flux means "minutes of time difference between my gitops source of truth and my target environment" and I love it that way.

@antonmatsiuk
Copy link

Counterpoint:

I am fond of the shifting left that flux encourages with its watchful polling of sources like git repos and image registries. Maybe it's a blocker to widespread community adoption to not have a way to put a human in the loop, but in some cases that human that pushes the deploy button is not the one that merged the app code. Flux is differentiated from many other imperative pipeline CD tools by this very nature which sets the "human in the loop" to be those that commit artifacts to the watched sources of truth and the policies/controls around them.

Let's take the scenario of putting a gate up on a Friday afternoon, having the source evolve with new commits/pushes. Some developers finish their projects and update the source of truth, say with refs to new images. On Monday when the gate is lifted the accumulation of several code changes are released. If things perform unexpectedly, what is the last good state? The last commit before the gate went up, sure, but this is cross domain. I would much rather roll back a merge train that hit Monday morning with reverting commits. You will have better team engagement with the release if you remove the safety of a gate window.

Promoting the suspend feature to a more first class implementation seems to strike the right balance IMHO.

Disclaimer:

Please forgive my rant, but to me flux means "minutes of time difference between my gitops source of truth and my target environment" and I love it that way.

I think your argument is unrelated to what most of the people here ask for. The most valuable use case for the gating feature is to have "maintenance windows" on the environments that reconcile from a stable release branch, and not for having CD-style of deployment that you've described. Imagine you have dozens of environments that have to be updated across the globe within fixed timeframes. Do you want to wake up at night to press the merge button to update the environment? If not, then you have to play around with scheduling of "PR merge" events as CronJobs and creating different branches for different timezones in your version control platform. This will become messy and complex pretty fast. Gating feature is an elegant solution for this problem, and if you need to update the environment outside the maintenance window, an "override" flag should be there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rfc Feature request proposals in the RFC format
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

7 participants