Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hhmi cluster #3129

Merged
merged 1 commit into from
Sep 14, 2023
Merged

Conversation

GeorgianaElena
Copy link
Member

@GeorgianaElena GeorgianaElena commented Sep 13, 2023

For #3080

Also updates the daskhub template to not assume the name of the first hub added to the cluster will be staging, but to use the var passed through the cmd.

terraform plan output
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_container_cluster.cluster will be created
  + resource "google_container_cluster" "cluster" {
      + cluster_ipv4_cidr           = (known after apply)
      + datapath_provider           = (known after apply)
      + default_max_pods_per_node   = (known after apply)
      + enable_binary_authorization = false
      + enable_intranode_visibility = (known after apply)
      + enable_kubernetes_alpha     = false
      + enable_l4_ilb_subsetting    = false
      + enable_legacy_abac          = false
      + enable_shielded_nodes       = true
      + enable_tpu                  = (known after apply)
      + endpoint                    = (known after apply)
      + id                          = (known after apply)
      + initial_node_count          = 1
      + label_fingerprint           = (known after apply)
      + location                    = "us-west2"
      + logging_service             = (known after apply)
      + master_version              = (known after apply)
      + monitoring_service          = (known after apply)
      + name                        = "hhmi-cluster"
      + network                     = "default"
      + networking_mode             = (known after apply)
      + node_locations              = [
          + "us-west2",
        ]
      + node_version                = (known after apply)
      + operation                   = (known after apply)
      + private_ipv6_google_access  = (known after apply)
      + project                     = "hhmi"
      + remove_default_node_pool    = true
      + self_link                   = (known after apply)
      + services_ipv4_cidr          = (known after apply)
      + subnetwork                  = (known after apply)
      + tpu_ipv4_cidr_block         = (known after apply)

      + addons_config {
          + horizontal_pod_autoscaling {
              + disabled = true
            }
          + http_load_balancing {
              + disabled = true
            }
        }

      + cluster_autoscaling {
          + autoscaling_profile = "OPTIMIZE_UTILIZATION"
          + enabled             = false
        }

      + monitoring_config {
          + enable_components = (known after apply)

          + managed_prometheus {
              + enabled = false
            }
        }

      + network_policy {
          + enabled = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = (known after apply)
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = (known after apply)
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = (known after apply)
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = (known after apply)
          + preemptible       = false
          + service_account   = (known after apply)
          + spot              = false
          + taint             = (known after apply)
        }

      + release_channel {
          + channel = "UNSPECIFIED"
        }

      + workload_identity_config {
          + workload_pool = "hhmi.svc.id.goog"
        }
    }

  # google_container_node_pool.core will be created
  + resource "google_container_node_pool" "core" {
      + cluster                     = "hhmi-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 1
      + instance_group_urls         = (known after apply)
      + location                    = "us-west2"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "core-pool"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "hhmi"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 5
          + min_node_count  = 1
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = 30
          + disk_type         = (known after apply)
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "core"
              + "k8s.dask.org/node-purpose"    = "core"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-4"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = (known after apply)
          + spot              = false
          + tags              = []
          + taint             = (known after apply)
        }
    }

  # google_container_node_pool.dask_worker["worker"] will be created
  + resource "google_container_node_pool" "dask_worker" {
      + cluster                     = "hhmi-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-west2"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "dask-worker"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "hhmi"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 200
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "k8s.dask.org/node-purpose" = "worker"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = true
          + service_account   = (known after apply)
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "k8s.dask.org_dedicated"
                  + value  = "worker"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["large"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "hhmi-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-west2"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-large"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "hhmi"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-64"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = (known after apply)
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["medium"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "hhmi-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-west2"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-medium"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "hhmi"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = (known after apply)
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["small"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "hhmi-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-west2"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-small"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "hhmi"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-4"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = (known after apply)
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_filestore_instance.homedirs[0] will be created
  + resource "google_filestore_instance" "homedirs" {
      + create_time = (known after apply)
      + etag        = (known after apply)
      + id          = (known after apply)
      + location    = "us-west2"
      + name        = "hhmi-homedirs"
      + project     = "hhmi"
      + tier        = "BASIC_HDD"
      + zone        = (known after apply)

      + file_shares {
          + capacity_gb   = 1024
          + name          = "homes"
          + source_backup = (known after apply)
        }

      + networks {
          + connect_mode      = "DIRECT_PEERING"
          + ip_addresses      = (known after apply)
          + modes             = [
              + "MODE_IPV4",
            ]
          + network           = "default"
          + reserved_ip_range = (known after apply)
        }
    }

  # google_monitoring_alert_policy.disk_space_full_alert will be created
  + resource "google_monitoring_alert_policy" "disk_space_full_alert" {
      + combiner              = "OR"
      + creation_record       = (known after apply)
      + display_name          = "Available disk space < 10% on hhmi"
      + enabled               = true
      + id                    = (known after apply)
      + name                  = (known after apply)
      + notification_channels = (known after apply)
      + project               = "hhmi"

      + conditions {
          + display_name = "Simple Health Check Endpoint"
          + name         = (known after apply)

          + condition_threshold {
              + comparison      = "COMPARISON_LT"
              + duration        = "300s"
              + filter          = <<-EOT
                    resource.type = "filestore_instance"
                    AND metric.type = "file.googleapis.com/nfs/server/free_bytes_percent"
                EOT
              + threshold_value = 10

              + aggregations {
                  + alignment_period   = "300s"
                  + per_series_aligner = "ALIGN_MEAN"
                }
            }
        }
    }

  # google_monitoring_notification_channel.pagerduty_disk_space will be created
  + resource "google_monitoring_notification_channel" "pagerduty_disk_space" {
      + display_name        = "PagerDuty Disk Space Alerts"
      + enabled             = true
      + force_delete        = false
      + id                  = (known after apply)
      + name                = (known after apply)
      + project             = "hhmi"
      + type                = "pagerduty"
      + verification_status = (known after apply)

      + sensitive_labels {
          + service_key = (sensitive value)
        }
    }

  # google_project_iam_custom_role.requestor_pays will be created
  + resource "google_project_iam_custom_role" "requestor_pays" {
      + deleted     = (known after apply)
      + description = "Minimal role for hub users on hhmi to identify as current project"
      + id          = (known after apply)
      + name        = (known after apply)
      + permissions = [
          + "serviceusage.services.use",
        ]
      + project     = "hhmi"
      + role_id     = "hhmi_requestor_pays"
      + stage       = "GA"
      + title       = "Identify as project role for users in hhmi"
    }

  # google_project_iam_member.cd_sa_roles["roles/artifactregistry.writer"] will be created
  + resource "google_project_iam_member" "cd_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "hhmi"
      + role    = "roles/artifactregistry.writer"
    }

  # google_project_iam_member.cd_sa_roles["roles/container.admin"] will be created
  + resource "google_project_iam_member" "cd_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "hhmi"
      + role    = "roles/container.admin"
    }

  # google_project_iam_member.cluster_sa_roles["roles/artifactregistry.reader"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "hhmi"
      + role    = "roles/artifactregistry.reader"
    }

  # google_project_iam_member.cluster_sa_roles["roles/logging.logWriter"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "hhmi"
      + role    = "roles/logging.logWriter"
    }

  # google_project_iam_member.cluster_sa_roles["roles/monitoring.metricWriter"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "hhmi"
      + role    = "roles/monitoring.metricWriter"
    }

  # google_project_iam_member.cluster_sa_roles["roles/monitoring.viewer"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "hhmi"
      + role    = "roles/monitoring.viewer"
    }

  # google_project_iam_member.cluster_sa_roles["roles/stackdriver.resourceMetadata.writer"] will be created
  + resource "google_project_iam_member" "cluster_sa_roles" {
      + etag    = (known after apply)
      + id      = (known after apply)
      + member  = (known after apply)
      + project = "hhmi"
      + role    = "roles/stackdriver.resourceMetadata.writer"
    }

  # google_service_account.cd_sa will be created
  + resource "google_service_account" "cd_sa" {
      + account_id   = "hhmi-cd-sa"
      + disabled     = false
      + display_name = "Continuous Deployment SA for hhmi"
      + email        = (known after apply)
      + id           = (known after apply)
      + member       = (known after apply)
      + name         = (known after apply)
      + project      = "hhmi"
      + unique_id    = (known after apply)
    }

  # google_service_account.cluster_sa will be created
  + resource "google_service_account" "cluster_sa" {
      + account_id   = "hhmi-cluster-sa"
      + disabled     = false
      + display_name = "Service account used by nodes of cluster hhmi"
      + email        = (known after apply)
      + id           = (known after apply)
      + member       = (known after apply)
      + name         = (known after apply)
      + project      = "hhmi"
      + unique_id    = (known after apply)
    }

  # google_service_account_key.cd_sa will be created
  + resource "google_service_account_key" "cd_sa" {
      + id                 = (known after apply)
      + key_algorithm      = "KEY_ALG_RSA_2048"
      + name               = (known after apply)
      + private_key        = (sensitive value)
      + private_key_type   = "TYPE_GOOGLE_CREDENTIALS_FILE"
      + public_key         = (known after apply)
      + public_key_type    = "TYPE_X509_PEM_FILE"
      + service_account_id = (known after apply)
      + valid_after        = (known after apply)
      + valid_before       = (known after apply)
    }

Plan: 20 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + buckets                             = {}
  + ci_deployer_key                     = (sensitive value)
  + kubernetes_sa_annotations           = {}
  + registry_sa_keys                    = (sensitive value)
  + regular_channel_latest_k8s_versions = {
      + "1."    = "1.27.4-gke.900"
      + "1.24." = "1.24.16-gke.500"
      + "1.25." = "1.25.12-gke.500"
      + "1.26." = "1.26.7-gke.500"
      + "1.27." = "1.27.4-gke.900"
}

@GeorgianaElena GeorgianaElena merged commit 01b685c into 2i2c-org:master Sep 14, 2023
2 checks passed
@GeorgianaElena GeorgianaElena deleted the hhmi-hub-cluster branch September 14, 2023 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done 🎉
Development

Successfully merging this pull request may close these issues.

2 participants