Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure images #35

Closed
wants to merge 57 commits into from
Closed

Azure images #35

wants to merge 57 commits into from

Conversation

flokli
Copy link
Collaborator

@flokli flokli commented Dec 7, 2023

  • disk creation has been moved to use the nixpkgs tooling to build images and boot up machines with the desired NixOS config directly, rather than infecting them post-first-boot
  • new terraform directory/state in "terraform/jenkins"
  • ssh config moved to cloud-init. Changing this still requires VM recreation, but no new image
  • new terraform module to create image and upload into Azure
  • new terraform module to create VM in Azure, exposing knobs for machine identity, portforwards and data disk attachment
  • new machine "binary-cache" defined, exposing a blob storage container / bucket via https
    • uses machine identity to get access to binary cache bucket
  • data disk handling overhauled, longer-living state is now moved to separate data disks, which are mounted by cloud-init

@flokli flokli requested review from henrirosten, brianmcgee and mkaapu and removed request for henrirosten and brianmcgee December 7, 2023 18:57
@flokli flokli force-pushed the azure-images branch 2 times, most recently from 5b5835a to 2e1553a Compare December 7, 2023 19:46
flake.nix Show resolved Hide resolved
Copy link
Contributor

@alextserepov alextserepov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit hard to follow without having a full big picture in mind, but some comments from my side.

terraform/jenkins/jenkins_controller.tf Show resolved Hide resolved
terraform/jenkins/binary_cache.tf Outdated Show resolved Hide resolved
@henrirosten henrirosten self-requested a review December 8, 2023 10:53
Copy link
Collaborator

@henrirosten henrirosten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is a PoC at this point; it's unclear to me which parts are still not fully ready for review. Thus, I might be nagging on things that are still in-progress. Please let me know if that's the case.

Documentation:

  • We should update at least the main README.md and terraform/README.md to reflect the changes done in this PR.
  • Documentation should cover the intended workflow: how to do a clean install, how to update hosts, how are secrets deployed and updated, etc.
  • About portability: it would be good to have some guidelines for how this could be ported on other cloud providers. Also it would be good to address how to deploy on non-cloud environments.

The new configuration seems completely disconnected from the earlier config and usage, which is currently explained in https://github.com/tiiuae/ghaf-infra/blob/main/README.md. I'm fine changing this, perhaps completely dropping the task.py helper and earlier configurations where possible, but I think somehow this should be addressed and explain why the two separate configuration methods are needed, as well as when and how each should be used.

@flokli flokli force-pushed the azure-images branch 3 times, most recently from 9aae79d to d25787d Compare December 13, 2023 15:58
@flokli
Copy link
Collaborator Author

flokli commented Dec 13, 2023

There's been a bit of back and forth w.r.t. ensuring the data disks get mounted reliably, so I didn't update the branch. Sorry for the radio silence. It's now sorted out.

The config is a shift from the previous method, yes :

  • machines are considered cattle and short-lived. Whenever we change the Nix config of the host, we bake a new "appliance image", tear down the VM, and start it up again with the new image.
  • machines doing the exact same thing, but using another bucket or domain name have a small config knob that can be set at runtime (bootup) before the service is started up, so we can use the same "appliance image". In Azure, we pass this in via cloud-init, but the mechanism itself is cloud-agnostic and works on bare-metal too.
  • only the state on the explicitly created data volumes is preserved (things like ssl certificates).

This means there's not really a need to SSH in and change certain things, or activate a new system configuration, removing the necessity for a lot of the steps documented in the documentation.

Ideally, bare-metal machines are treated very similar - we could have a "write once on every boot, during boot" system partition containing the appliance image (or netboot the kernel initrd and run from there), and we'd only need a data partitions where/if we need state.

Yes, the docs need updating, but we're currently in this transition state where we only have the config for some hosts (and I don't want to delete documentation on how to deal with the other currently present hosts).

It might make more sense to do this once we have the full POC to show.

henrirosten
henrirosten previously approved these changes Dec 14, 2023
@flokli flokli force-pushed the azure-images branch 4 times, most recently from ff6f6bd to abeea86 Compare December 19, 2023 15:53
@flokli
Copy link
Collaborator Author

flokli commented Dec 19, 2023

Ok, this should be ready for another review, as most of the nix-specific build infrastructure components are tied together.

  • I had to spend quite some time getting the azure platform support in nixpkgs in shape, so we can mount data volumes (to keep state where needed in an image-based deployment model) and be able to use the resource disk as an overlay for /nix/store (with that mount having to happen quite early), which is somewhat required in Azure too, as the system disk is too small to the /nix/store, at least on a builder. It all worked out, but nixpkgs currently points to a version containing these low-level fixes. Once that's merged, we can move to upstream again.
  • There's now 2 build machines allowing ssh (only) for a remote-build ssh user with that public key. No individual user accounts or secrets live on these machines. (The VM have read-only access to the bucket to substitute from there directly, but that's instance-identity-based and mostly to avoid having to pipe all of this through the binary-cache VM, which is more for non-Azure consumers.)
  • There's two secrets used on the jenkins-controller VM (only), an ed25519 ssh key and nix cache signing key. They are provided via Azure Key Vault. We currently use a python script to export these secrets into two files, but I'd like to switch this to systemd service credentials and an azure key vault compatible systemd service credentials daemon, so we can use iptables to restrict access to the metadata server to uid 0 only (and rclone-http.service, or use another workaround there). This is to prevent other users on the system from accessing the Azure Key Vault as well.
  • /etc/nix/machines on the jenkins-controller VM is populated with the list of remote builders and their host keys. Currently, this happens once on startup, though we might want to make this more flexible and dynamic in the future. This might require some more machinery, to dynamically register machines and publish host keys when ready.
  • The jenkins-controller VM has a post-build-hook configured, which uploads every build into the Azure Storage account, signing with the cache signing key while doing so.

All these machines are dispensable, and except for the few data volumes we mount, loose all state whenever the configuration is changed. Most "environment-specific config" (like hostnames to use) is populated on startup through cloud-init, treating the individual image more like an "appliance" (Nix substituters and binary cache public keys being one annoying exception, but there's also ways to workaround this).

What's missing is configuring the Jenkins pipelines themselves, so Nix builds are triggered - though the setup can already be tested by ssh-ing to the jenkins-controller VM as an individual user (which is untrusted in terms of Nix), and triggering a build of something that's not already in one of the caches.

hosts/builder/configuration.nix Show resolved Hide resolved
hosts/jenkins-controller/configuration.nix Show resolved Hide resolved
hosts/jenkins-controller/configuration.nix Show resolved Hide resolved
"""
This script retrieves a secret specified in $SECRET_NAME
from an Azure Key Vault in $KEY_VAULT_NAME
and prints it to stdout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requested clarification in slack.

terraform/jenkins/binary_cache_signing.tf Show resolved Hide resolved
secret_permissions = [
"Get",
"List",
"Set"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to understand big picture on this a bit more. Don't feel comfortable with List & Set. But it depends on the context, naturally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to grant Terraform itself access to enter secrets into that vault. See the comment in L30.

# terraform import secret_resource.binary_cache_signing_key "$(< ./secret-key)"
resource "secret_resource" "binary_cache_signing_key" {
lifecycle {
prevent_destroy = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is our resource lifecycle policy in general? Do we have it documented somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly a safeguard against terraform deleting that resource from the state, as it only lives in there, in case it gets moved around and renamed etc. It is a recommended best practice as per https://github.com/numtide/terraform-provider-secret?tab=readme-ov-file#example.

I think the rest of this question is more one about policy, not about this code particularly.

@henrirosten henrirosten self-requested a review December 20, 2023 08:23
Copy link
Collaborator

@henrirosten henrirosten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this seems good to me. Also, thanks for the overall summary in #35 (comment) as well as providing some helpful comments throughout the changes and in the commit messages.

With that said, I would still like to have an 'end-user' documentation to better understand your thoughts as to what is the intended workflow: how to do a clean install, how to update hosts, how are secrets updated, how could this config be moved to other cloud providers, etc.

What I'm a bit concerned is that we seem to pretty strictly bind ourselves to Azure. Indeed, it would be important to be able to apply, or at least understand what are the changes required to apply this configuration on other providers or bare metal installations.

Again, I realize this is still a PoC, but I still believe we should have such 'end-user' documentation available as early as possible - it would also help the review.

hosts/jenkins-controller/get_secret.py Outdated Show resolved Hide resolved
hosts/binary-cache/configuration.nix Outdated Show resolved Hide resolved
hosts/builder/configuration.nix Show resolved Hide resolved
hosts/jenkins-controller/configuration.nix Show resolved Hide resolved
in ''
umask 077
mkdir -p /etc/secrets/
${get-secret} > /etc/secrets/remote-build-ssh-key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, essentially we have bought a safe. Very expensive one. It is in root's room. And we store our precious secret there. And here we make a copy of our secret and put it on the desk near it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connection to the builders happens via ssh. I don't think there's an ssh-key agent that's able to offload key exchange / challenge response while doing the ssh connection to azure key vault.

I need to get some secret to be able to ssh in, and I'm mostly follwing azures best practices here on not putting any secrets into cloud-init userdata.

If we want to keep using ssh to authenticate to builders (I don't think we should, but go more to some "build agents register themselves" model long-term), I'd rather go with something like https://learn.microsoft.com/en-us/entra/identity/devices/howto-vm-sign-in-azure-ad-linux, if that can work with machine identities. Note this brings its own ssh and pam modules, of which none of them are properly packaged in any distro, it's just some closed-source software package the agent imperatively tries to install (which will fail on NixOS).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still disagree with printing private keys to stdout. Even if stdout is redirected to a file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you'd prefer the python script having support to write to the file directly?

in ''
umask 077
mkdir -p /etc/secrets/
${get-secret} > /etc/secrets/nix-signing-key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here is another desk with our secret ... ;)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix is pretty bad when it comes to signing, and only supports the private key being present on some filesystem - here's the equivalent desk in the existing Hydra setup:

/run/current-system/sw/bin/chmod 440 /etc/nix/hydra/secret

I agree signing shouldn't happen on the box, but instead we should offload the signing to an HSM/Key vault, so even the jenkins-controller can't see the secret. As long as that other signer can establish machine identity and supports signing arbitrary ed25519 payloads (the NARInfo fingerprint), the rest of Nix doesn't care how the signature was made.

I already wrote most of the building blocks w.r.t calculating fingerprints, adding signatures to NARInfo files etc. I'll add a futurework item for this in the docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, in a short term, the signing can be simply replaced with a API call to the Keyvault. The keyvault stores the key instead of secret and does the signing with the key. We send the hash over REST API for example and get the signature back. The public key can be delivered to receiving party or the same REST API call can be made to the same keyvault. The keyvault can be backed by software or by real HSM, depending purely on the budget. Naturally, access policy and firewalling must be considered here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix doesn't support any of this yet, it insists on having the private key available as a file. That's the current state of things.

If you want to go into patching Nix, there's plans to change this, and a draft PR, but none of this is likely to land in the very short term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, since we are not using nix-serve, as discussed, we might as well just sign in Jenkins in "post build" mode. There is nothing that would prevent us from running it as 1 extra post-build step, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signing store paths happens on all builds, and is handled by the nix daemon, not Jenkins directly.

Jenkins doesn't have access to the signing keys or any way to influence the uploading process, and this ensures all builds are captured. But we can extend/change the post-build-hook script to do something else for signing, yes.

terraform/jenkins/builder.tf Show resolved Hide resolved
terraform/jenkins/jenkins_controller.tf Show resolved Hide resolved
terraform/jenkins/jenkins_controller.tf Show resolved Hide resolved
resource_group_name = azurerm_resource_group.default.name
location = azurerm_resource_group.default.location

security_rule {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need inbound ssh on Jenkins controller?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I could ssh in and debug the config. Also, until Jenkins is configured declaratively and uses SSO for login and RBAC, the admin password needs to be fetched from the logs/some file in the jenkins state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please close on production system, unless there is other use than debug/provisioning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's a block in the futurework section about SSO and declarative pipeline config.
At least parts of this need to be addressed.

source = "hashicorp/azurerm"
}
secret = {
source = "numtide/secret"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is https://github.com/numtide/terraform-provider-secret, and tells terraform we mean numtide/secret, not the non-existing hashicorp/secret default.

The actual terraform provider is provided through Nix (see devshell.nix).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second line of readme: "Please be careful about your security stance before adopting this!" Are we careful about our security stance? ;)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly about who has access to the state bucket, and limiting the amount of people having access to the state bucket, and feeling comfortable about having the secrets in there.

We only use it to store the signing key, if we move that into the Key Vault, delegate signing to it entirely, we might as well create the key material there without the private key material ever touching terraform, removing the need for the provider alltogether.

This adds an additional "remote-build" ssh user.

The Jenkins controller will use this as user to do remote Nix builds.

Signed-off-by: Florian Klink <[email protected]>
Signed-off-by: Florian Klink <[email protected]>
This adds a allocate_public_ip boolean variable (defaulting to false),
and will only create a public ip if it's set to true.

Signed-off-by: Florian Klink <[email protected]>
This deploys two builders in a new subnet.

Signed-off-by: Florian Klink <[email protected]>
This creates an azure key vault and adds the private key as a secret
into there, then grants the jenkins-controller VM access to read that
secret.

Signed-off-by: Florian Klink <[email protected]>
Use the common group, instead of the current client object id.

Signed-off-by: Florian Klink <[email protected]>
This adds a fetch-build-ssh-key systemd service that fetches the ssh
private key into /etc/secrets/remote-build-ssh-key (owned by root),
and orders itself before nix-daemon.

Signed-off-by: Florian Klink <[email protected]>
Render /etc/nix/machines with terraform. In the future, we might want to
autodiscover this, or better, have agents register with the controller,
rather than having to recreate the VM whenever the list of builders is
changed.

Signed-off-by: Florian Klink <[email protected]>
This creates a Nix signing key, and uses terraform-provider-secret to
hold it in the terraform state.

It's then uploaded into an Azure key vault.

The jenkins-controller VM has access to it, and puts it at /etc/secrets/
nix-signing-key.

A post-build-hook is configured, uploading every build to the binary
cache bucket, with the signature.

Signed-off-by: Florian Klink <[email protected]>
There's no need for any user to ssh into builders, this can be dropped.

Signed-off-by: Florian Klink <[email protected]>
The consumes a list of IPs to ssh-keycan once, on startup.

In the future, we might want to add support for dynamic discovery, as
additional (longer-lived) static hosts.

Signed-off-by: Florian Klink <[email protected]>
Prevent the repo and nixpkgs linter from fighting each other about
formatting.

Signed-off-by: Florian Klink <[email protected]>
This describes the current concepts and components in this PR with more
prose.

It also describes some of the known issues / compromises.
A much cleaner solution would be to allow offloading the signing mechanism to
Azure Key Vault, or "external signers in general".

In terms of complexity, there's already
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even simpler solution: Generate hash & send REST API to KeyVault.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix uses a fingerprint of a NAR and requires a signature for that. There's currently no direct mechanism inside Nix to retrieve this [fingerprint[(https://docs.tvix.dev/rust/nix_compat/narinfo/fn.fingerprint.html)

### Environment-agnostic
Images are supposed to be *environment-agnostic*, allowing multiple deployments
to share the same image / Nix code to build it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flokli
Copy link
Collaborator Author

flokli commented Feb 29, 2024

Closing, seems this all ended up in the main branch one way or another.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants