Skip to content

Terraform for basic infrastructure required to run DataRobot on Azure

License

Notifications You must be signed in to change notification settings

datarobot-oss/terraform-azurerm-dr-infra

Repository files navigation

terraform-azurerm-dr-infra

Terraform module to create Azure Cloud infrastructure resources required to run DataRobot.

Usage

module "datarobot_infra" {
  source = "datarobot-oss/dr-infra/azurerm"

  name        = "datarobot"
  domain_name = "yourdomain.com"
  location    = "eastus"

  create_resource_group          = true
  create_network                 = true
  network_address_space          = "10.7.0.0/16"
  existing_public_dns_zone_id    = "/subscriptions/subscription-id/resourceGroups/existing-resource-group-name/providers/Microsoft.Network/dnszones/yourdomain.com"
  create_storage                 = true
  existing_container_registry_id = "/subscriptions/subscription-id/resourceGroups/existing-resource-group-name/providers/Microsoft.ContainerRegistry/registries/existing-acr-name"
  create_kubernetes_cluster      = true
  create_app_identity            = true

  ingress_nginx                          = true
  internet_facing_ingress_lb             = true
  cert_manager                           = true
  cert_manager_letsencrypt_email_address = [email protected]
  external_dns                           = true
  nvidia_device_plugin                   = true
  descheduler                            = true

  tags = {
    application = "datarobot"
    environment = "dev"
    managed-by  = "terraform"
  }
}

Examples

  • Complete - Demonstrates all input variables
  • Partial - Demonstrates the use of existing resources
  • Minimal - Demonstrates the minimum set of input variables needed to deploy all infrastructure

Using an example directly from source

  1. Clone the repo
git clone https://github.com/datarobot-oss/terraform-azurerm-dr-infra.git
  1. Change directories into the example that best suits your needs
cd terraform-azurerm-dr-infra/examples/minimal
  1. Modify main.tf as needed
  2. Run terraform commands
terraform init
terraform plan
terraform apply
terraform destroy

Module Descriptions

Resource Group

Toggle

  • create_resource_group to create a new Azure Resource Group
  • existing_resource_group_name to use an existing resource group

Description

Create a new Azure Resource Group to put all created resources in.

Permissions

Contributor

Network

Toggle

  • create_network to create a new Azure Virtual Network
  • existing_vnet_id to use an existing VNet

Description

Create a new Azure Virtual Network (VNet) with one subnet and a NAT gateway with a Public IP attached.

Permissions

Network Contributor

DNS

Toggle

  • create_dns_zones to create new Azure DNS zones
  • existing_public_dns_zone_id / existing_private_dns_zone_id to use an existing Azure DNS zone

Description

Create new public and/or private DNS zones with name domain_name.

A public Route53 zone is used by external_dns to create records for the DataRobot ingress resources when internet_facing_ingress_lb is true. It is also used for DNS validation when using cert_manager and cert_manager_letsencrypt_clusterissuers.

A private Route53 zone is used by external_dns to create records for the DataRobot ingress resources when internet_facing_ingress_lb is false.

Permissions

  • DNS Zone Contributor
  • Private DNS Zone Contributor

Storage

Toggle

  • create_storage to create a new Azure Storage Account and Container
  • existing_storage_account_id to use an existing Azure Storage Account

Description

Create a new Azure Storage Account and Container with public internet access allowed by default and PrivateLink access from within the VNet. Network access to the ACR can be managed via the storage_public_network_access_enabled, storage_network_rules_default_action, storage_public_ip_allow_list, and storage_virtual_network_subnet_ids variables.

The DataRobot application will use this storage account for persistent file storage.

Permissions

Storage Account Contributor

Container Registry

Toggle

  • create_container_registry to create a new Azure Container Registry
  • existing_container_registry_id to use an existing Azure Container Registry

Description

Create a new Azure Container Registry with public internet access allowed by default and PrivateLink access from within the VNet. Network access to the ACR can be managed via the container_registry_public_network_access_enabled, container_registry_network_rules_default_action, and container_registry_ip_allow_list variables.

The DataRobot application will use this registry to host custom images created by various services.

Permissions

TBD

Kubernetes

Toggle

  • create_kubernetes_cluster to create a new Azure Kubernetes Service Cluster
  • existing_aks_cluster_name to use an existing AKS cluster

Description

Create a new AKS cluster to host the DataRobot application and any other helm charts installed by this module.

The AKS cluster Kubernetes API endpoint can either be made available over the public internet or privately to the VNet. When kubernetes_cluster_endpoint_public_access is false, the cluster API endpoint is only available from within the VNet via a PrivateLink. When kubernetes_cluster_endpoint_public_access is true, the cluster API endpoint is accessed via the public internet. This access can be restricted to specific IP addresses via the kubernetes_cluster_endpoint_public_access_cidrs variable.

Two node groups are created:

  • A primary node group intended to host the majority of the DataRobot pods
  • A gpu node group intended to host GPU workload pods

By default, Azure uses 10.0.0.0/16 for Kubernetes services and 10.244.0.0/16 for Kubernetes pods. Ensure these do not conflict with your VNet address space by either specifying a different network_address_space for the VNet created by this module, or by specifying alternate kubernetes_pod_cidr and/or kubernetes_service_cidr as needed.

Permissions

TBD

Helm Chart - ingress-nginx

Toggle

  • ingress_nginx to install the ingress-nginx helm chart

Description

Uses the terraform-helm-release module to install the ingress-nginx helm chart from the https://kubernetes.github.io/ingress-nginx repo into the ingress-nginx namespace.

The ingress-nginx helm chart will trigger the deployment of an Azure Standard Load Balancer directing traffic to the ingress-nginx-controller Kubernetes services.

Values passed to the helm chart can be overridden by passing a custom values file via the ingress_nginx_values variable as demonstrated in the complete example.

Permissions

Not required

Helm Chart - cert-manager

Toggle

  • cert_manager to install the cert-manager helm chart

Description

Uses the terraform-helm-release module to install the cert-manager helm chart from the https://charts.jetstack.io repo into the cert-manager namespace.

A User Assigned Identity and Federated Identity Credential is created for the cert-manager service account running in the cert-manager namespace that allows the creation of DNS resources within the specified DNS zone.

cert-manager can be used by the DataRobot application to create and manage various certificates including the application.

When cert_manager_letsencrypt_clusterissuers is enabled, letsencrypt-staging and letsencrypt-prod ClusterIssuers will be created which can be used by the datarobot-azure umbrella chart to issue certificates used by the DataRobot application. The default values in that helm chart (as of version 10.2) have global.ingress.tls.enabled, global.ingress.tls.certmanager, and global.ingress.tls.issuer as letsencrypt-prod which will use the letsencrypt-prod ClusterIssuer to issue a public ACME certificate as the TLS certificate used by the Kubernetes ingress resources.

Values passed to the helm chart can be overridden by passing a custom values file via the cert_manager_values variable as demonstrated in the complete example.

Permissions

TBD

Helm Chart - external-dns

Toggle

  • external_dns to install the external-dns helm chart

Description

Uses the terraform-helm-release module to install the external-dns helm chart from the https://charts.bitnami.com/bitnami repo into the external-dns namespace.

A User Assigned Identity and Federated Identity Credential is created for the external-dns service account running in the external-dns namespace that allows the creation of DNS resources within the specified DNS zone.

external-dns is used to automatically create DNS records for ingress resources in the Kubernetes cluster. When the DataRobot application is installed and the ingress resources are created, external-dns will automatically create a DNS record pointing at the ingress resource.

Values passed to the helm chart can be overridden by passing a custom values file via the external_dns_values variable as demonstrated in the complete example.

Permissions

TBD

Helm Chart - nvidia-device-plugin

Toggle

  • nvidia_device_plugin to install the nvidia-device-plugin helm chart

Description

Uses the terraform-helm-release module to install the nvidia-device-plugin helm chart from the https://nvidia.github.io/k8s-device-plugin repo into the nvidia-device-plugin namespace.

This helm chart is used to expose GPU resources on nodes intended for GPU workloads such as the default gpu node group.

Values passed to the helm chart can be overridden by passing a custom values file via the nvidia_device_plugin_values variable as demonstrated in the complete example.

Permissions

Not required

Helm Chart - descheduler

Toggle

  • descheduler to install the descheduler helm chart

Description

Uses the terraform-helm-release module to install the descheduler helm chart from the https://kubernetes-sigs.github.io/descheduler/ helm repo into the descheduler namespace.

This helm chart allows for automatic rescheduling of pods for optimizing resource consumption.

Permissions

Not required

Comprehensive Required Permissions

TBD

DataRobot versions

Release Supported DR Versions
>= 1.0 >= 10.0

Requirements

Name Version
terraform >= 1.3.5
azurerm >= 4.3.0
helm >= 2.15.0
kubectl >= 1.14.0

Providers

Name Version
azurerm >= 4.3.0

Modules

Name Source Version
app_identity ./modules/app-identity n/a
cert_manager ./modules/cert-manager n/a
container_registry ./modules/container-registry n/a
descheduler ./modules/descheduler n/a
dns ./modules/dns n/a
external_dns ./modules/external-dns n/a
ingress_nginx ./modules/ingress-nginx n/a
kubernetes ./modules/kubernetes n/a
naming Azure/naming/azurerm ~> 0.4
network ./modules/network n/a
nvidia_device_plugin ./modules/nvidia-device-plugin n/a
storage ./modules/storage n/a

Resources

Name Type
azurerm_resource_group.this resource
azurerm_kubernetes_cluster.existing data source
azurerm_subscription.current data source

Inputs

Name Description Type Default Required
cert_manager Install the cert-manager helm chart. All other cert_manager variables are ignored if this variable is false. bool true no
cert_manager_letsencrypt_clusterissuers Whether to create letsencrypt-prod and letsencrypt-staging ClusterIssuers bool true no
cert_manager_letsencrypt_email_address Email address for the certificate owner. Let's Encrypt will use this to contact you about expiring certificates, and issues related to your account. Only required if cert_manager_letsencrypt_clusterissuers is true. string "[email protected]" no
cert_manager_values Path to templatefile containing custom values for the cert-manager helm chart string "" no
cert_manager_variables Variables passed to the cert_manager_values templatefile any {} no
container_registry_ip_allow_list List of CIDR blocks to allow access to the container registry. Only IPv4 addresses are allowed list(string) [] no
container_registry_network_rules_default_action Specifies the default action of allow or deny when no other rules match string "Allow" no
container_registry_public_network_access_enabled Whether the public network access to the container registry is enabled bool true no
create_app_identity Create a new user assigned identity for the DataRobot application bool true no
create_container_registry Create a new Azure Container Registry. Ignored if an existing existing_container_registry_id is specified. bool true no
create_dns_zones Create DNS zones for domain_name. Ignored if existing_public_dns_zone_id and existing_private_dns_zone_id are specified. bool true no
create_kubernetes_cluster Create a new Azure Kubernetes Service cluster. All kubernetes and helm chart variables are ignored if this variable is false. bool true no
create_network Create a new Azure Virtual Network. Ignored if an existing existing_vnet_id is specified. bool true no
create_resource_group Create a new Azure resource group. Ignored if existing existing_resource_group_name is specified. bool true no
create_storage Create a new Azure Storage account and container. Ignored if an existing_storage_account_id is specified. bool true no
datarobot_namespace Kubernetes namespace in which the DataRobot application will be installed string "dr-app" no
datarobot_service_accounts Names of the Kubernetes service accounts used by the DataRobot application set(string)
[
"dr",
"build-service",
"build-service-image-builder",
"buzok-account",
"dr-lrs-operator",
"dynamic-worker",
"internal-api-sa",
"nbx-notebook-revisions-account",
"prediction-server-sa",
"tileservergl-sa"
]
no
descheduler Install the descheduler helm chart to enable rescheduling of pods. All other descheduler variables are ignored if this variable is false bool true no
descheduler_values Path to templatefile containing custom values for the descheduler helm chart string "" no
descheduler_variables Variables passed to the descheduler templatefile any {} no
domain_name Name of the domain to use for the DataRobot application. If create_dns_zones is true then zones will be created for this domain. It is also used by the cert-manager helm chart for DNS validation and as a domain filter by the external-dns helm chart. string "" no
existing_aks_cluster_name Name of existing AKS cluster to use. When specified, all other kubernetes variables will be ignored. string null no
existing_container_registry_id ID of existing container registry to use string "" no
existing_kubernetes_nodes_subnet_id ID of an existing subnet to use for the AKS node pools. Required when an existing_network_id is specified. Ignored if create_network is true and no existing_network_id is specified. string "" no
existing_private_dns_zone_id ID of existing private hosted zone to use for private DNS records created by external-dns. This is required when create_dns_zones is false and ingress_nginx is true with internet_facing_ingress_lb false. string "" no
existing_public_dns_zone_id ID of existing public hosted zone to use for public DNS records created by external-dns and public LetsEncrypt certificate validation by cert-manager. This is required when create_dns_zones is false and ingress_nginx and internet_facing_ingress_lb are true or when cert_manager and cert_manager_letsencrypt_clusterissuers are true. string "" no
existing_resource_group_name Name of existing resource group to use string "" no
existing_storage_account_id ID of existing Azure Storage Account to use for DataRobot file storage. When specified, all other storage variables will be ignored. string "" no
existing_vnet_id ID of an existing VNet to use. When specified, other network variables are ignored. string "" no
external_dns Install the external_dns helm chart to create DNS records for ingress resources matching the domain_name variable. All other external_dns variables are ignored if this variable is false. bool true no
external_dns_values Path to templatefile containing custom values for the external-dns helm chart string "" no
external_dns_variables Variables passed to the external_dns_values templatefile any {} no
ingress_nginx Install the ingress-nginx helm chart to use as the ingress controller for the AKS cluster. All other ingress_nginx variables are ignored if this variable is false. bool true no
ingress_nginx_values Path to templatefile containing custom values for the ingress-nginx helm chart string "" no
ingress_nginx_variables Variables passed to the ingress_nginx_values templatefile any {} no
internet_facing_ingress_lb Determines the type of Standard Load Balancer created for AKS ingress. If true, a public Standard Load Balancer will be created. If false, an internal Standard Load Balancer will be created. bool true no
kubernetes_cluster_endpoint_public_access Whether or not the Kubernetes API endpoint should be exposed to the public internet. When false, the cluster endpoint is only available internally to the virtual network. bool true no
kubernetes_cluster_endpoint_public_access_cidrs List of CIDR blocks which can access the Kubernetes API server endpoint list(string) [] no
kubernetes_cluster_version AKS cluster version string null no
kubernetes_dns_service_ip IP address within the Kubernetes service address range that will be used by cluster service discovery (kube-dns) string null no
kubernetes_gpu_nodepool_labels A map of Kubernetes labels to apply to the GPU node pool map(string)
{
"datarobot.com/node-capability": "gpu"
}
no
kubernetes_gpu_nodepool_max_count Maximum number of nodes in the GPU node pool number 10 no
kubernetes_gpu_nodepool_min_count Minimum number of nodes in the GPU node pool number 0 no
kubernetes_gpu_nodepool_name Name of the GPU node pool string "gpu" no
kubernetes_gpu_nodepool_node_count Node count of the GPU node pool number 0 no
kubernetes_gpu_nodepool_taints A list of Kubernetes taints to apply to the GPU node pool list(string)
[
"nvidia.com/gpu=true:NoSchedule"
]
no
kubernetes_gpu_nodepool_vm_size VM size used for the GPU node pool string "Standard_NC4as_T4_v3" no
kubernetes_nodepool_availability_zones Availability zones to use for the AKS node pools set(string)
[
"1",
"2",
"3"
]
no
kubernetes_pod_cidr The CIDR to use for Kubernetes pod IP addresses string null no
kubernetes_primary_nodepool_labels A map of Kubernetes labels to apply to the primary node pool map(string)
{
"datarobot.com/node-capability": "cpu"
}
no
kubernetes_primary_nodepool_max_count Maximum number of nodes in the primary node pool number 10 no
kubernetes_primary_nodepool_min_count Minimum number of nodes in the primary node pool number 1 no
kubernetes_primary_nodepool_name Name of the primary node pool string "primary" no
kubernetes_primary_nodepool_node_count Node count of the primary node pool number 1 no
kubernetes_primary_nodepool_taints A list of Kubernetes taints to apply to the primary node pool list(string) [] no
kubernetes_primary_nodepool_vm_size VM size used for the primary node pool string "Standard_D32s_v4" no
kubernetes_service_cidr The CIDR to use for Kubernetes service IP addresses string null no
location Azure location to create resources in string n/a yes
name Name to use as a prefix for created resources string n/a yes
network_address_space CIDR block to be used for the new VNet. By default, AKS uses 10.0.0.0/16 for services and 10.244.0.0/16 for pods. This should not overlap with the kubernetes_service_cidr or kubernetes_pod_cidr variables. string "10.1.0.0/16" no
nvidia_device_plugin Install the nvidia-device-plugin helm chart to expose node GPU resources to the AKS cluster. All other nvidia_device_plugin variables are ignored if this variable is false. bool true no
nvidia_device_plugin_values Path to templatefile containing custom values for the nvidia-device-plugin helm chart string "" no
nvidia_device_plugin_variables Variables passed to the nvidia_device_plugin_values templatefile any {} no
storage_account_replication_type Storage account data replication type as described in https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy string "ZRS" no
storage_network_rules_default_action Specifies the default action of the storage firewall to allow or deny when no other rules match string "Allow" no
storage_public_ip_allow_list List of public IP or IP ranges in CIDR Format which are allowed to access the storage account. Only IPv4 addresses are allowed. /31 CIDRs, /32 CIDRs, and Private IP address ranges (as defined in RFC 1918), are not allowed. Ignored if storage_public_network_access_enabled is false. list(string) [] no
storage_public_network_access_enabled Whether the public network access to the storage account is enabled bool true no
storage_virtual_network_subnet_ids List of resource IDs for subnets which are allowed to access the storage account list(string) null no
tags A map of tags to add to all created resources map(string)
{
"managed-by": "terraform"
}
no

Outputs

Name Description
aks_cluster_id ID of the Azure Kubernetes Service cluster
container_registry_admin_password Admin password of the container registry
container_registry_admin_username Admin username of the container registry
container_registry_id ID of the container registry
container_registry_login_server The URL that can be used to log into the container registry
private_zone_id ID of the private zone
public_zone_id ID of the public zone
resource_group_id The ID of the Resource Group
storage_access_key The primary access key for the storage account
storage_account_name Name of the storage account
storage_container_name Name of the storage container
user_assigned_identity_client_id Client ID of the user assigned identity
user_assigned_identity_id ID of the user assigned identity
user_assigned_identity_name Name of the user assigned identity
user_assigned_identity_principal_id Principal ID of the user assigned identity
user_assigned_identity_tenant_id Tenant ID of the user assigned identity
vnet_id The ID of the VNet

About

Terraform for basic infrastructure required to run DataRobot on Azure

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages