Terraform module to create Azure Cloud infrastructure resources required to run DataRobot.
module "datarobot_infra" {
source = "datarobot-oss/dr-infra/azurerm"
name = "datarobot"
domain_name = "yourdomain.com"
location = "eastus"
create_resource_group = true
create_network = true
network_address_space = "10.7.0.0/16"
existing_public_dns_zone_id = "/subscriptions/subscription-id/resourceGroups/existing-resource-group-name/providers/Microsoft.Network/dnszones/yourdomain.com"
create_storage = true
existing_container_registry_id = "/subscriptions/subscription-id/resourceGroups/existing-resource-group-name/providers/Microsoft.ContainerRegistry/registries/existing-acr-name"
create_kubernetes_cluster = true
create_app_identity = true
ingress_nginx = true
internet_facing_ingress_lb = true
cert_manager = true
cert_manager_letsencrypt_email_address = [email protected]
external_dns = true
nvidia_device_plugin = true
descheduler = true
tags = {
application = "datarobot"
environment = "dev"
managed-by = "terraform"
}
}
- Complete - Demonstrates all input variables
- Partial - Demonstrates the use of existing resources
- Minimal - Demonstrates the minimum set of input variables needed to deploy all infrastructure
- Clone the repo
git clone https://github.com/datarobot-oss/terraform-azurerm-dr-infra.git
- Change directories into the example that best suits your needs
cd terraform-azurerm-dr-infra/examples/minimal
- Modify
main.tf
as needed - Run terraform commands
terraform init
terraform plan
terraform apply
terraform destroy
create_resource_group
to create a new Azure Resource Groupexisting_resource_group_name
to use an existing resource group
Create a new Azure Resource Group to put all created resources in.
Contributor
create_network
to create a new Azure Virtual Networkexisting_vnet_id
to use an existing VNet
Create a new Azure Virtual Network (VNet) with one subnet and a NAT gateway with a Public IP attached.
Network Contributor
create_dns_zones
to create new Azure DNS zonesexisting_public_dns_zone_id
/existing_private_dns_zone_id
to use an existing Azure DNS zone
Create new public and/or private DNS zones with name domain_name
.
A public Route53 zone is used by external_dns
to create records for the DataRobot ingress resources when internet_facing_ingress_lb
is true
. It is also used for DNS validation when using cert_manager
and cert_manager_letsencrypt_clusterissuers
.
A private Route53 zone is used by external_dns
to create records for the DataRobot ingress resources when internet_facing_ingress_lb
is false
.
DNS Zone Contributor
Private DNS Zone Contributor
create_storage
to create a new Azure Storage Account and Containerexisting_storage_account_id
to use an existing Azure Storage Account
Create a new Azure Storage Account and Container with public internet access allowed by default and PrivateLink access from within the VNet. Network access to the ACR can be managed via the storage_public_network_access_enabled
, storage_network_rules_default_action
, storage_public_ip_allow_list
, and storage_virtual_network_subnet_ids
variables.
The DataRobot application will use this storage account for persistent file storage.
Storage Account Contributor
create_container_registry
to create a new Azure Container Registryexisting_container_registry_id
to use an existing Azure Container Registry
Create a new Azure Container Registry with public internet access allowed by default and PrivateLink access from within the VNet. Network access to the ACR can be managed via the container_registry_public_network_access_enabled
, container_registry_network_rules_default_action
, and container_registry_ip_allow_list
variables.
The DataRobot application will use this registry to host custom images created by various services.
TBD
create_kubernetes_cluster
to create a new Azure Kubernetes Service Clusterexisting_aks_cluster_name
to use an existing AKS cluster
Create a new AKS cluster to host the DataRobot application and any other helm charts installed by this module.
The AKS cluster Kubernetes API endpoint can either be made available over the public internet or privately to the VNet. When kubernetes_cluster_endpoint_public_access
is false
, the cluster API endpoint is only available from within the VNet via a PrivateLink. When kubernetes_cluster_endpoint_public_access
is true
, the cluster API endpoint is accessed via the public internet. This access can be restricted to specific IP addresses via the kubernetes_cluster_endpoint_public_access_cidrs
variable.
Two node groups are created:
- A
primary
node group intended to host the majority of the DataRobot pods - A
gpu
node group intended to host GPU workload pods
By default, Azure uses 10.0.0.0/16
for Kubernetes services and 10.244.0.0/16
for Kubernetes pods. Ensure these do not conflict with your VNet address space by either specifying a different network_address_space
for the VNet created by this module, or by specifying alternate kubernetes_pod_cidr
and/or kubernetes_service_cidr
as needed.
TBD
ingress_nginx
to install theingress-nginx
helm chart
Uses the terraform-helm-release module to install the ingress-nginx
helm chart from the https://kubernetes.github.io/ingress-nginx
repo into the ingress-nginx
namespace.
The ingress-nginx
helm chart will trigger the deployment of an Azure Standard Load Balancer directing traffic to the ingress-nginx-controller
Kubernetes services.
Values passed to the helm chart can be overridden by passing a custom values file via the ingress_nginx_values
variable as demonstrated in the complete example.
Not required
cert_manager
to install thecert-manager
helm chart
Uses the terraform-helm-release module to install the cert-manager
helm chart from the https://charts.jetstack.io
repo into the cert-manager
namespace.
A User Assigned Identity and Federated Identity Credential is created for the cert-manager
service account running in the cert-manager
namespace that allows the creation of DNS resources within the specified DNS zone.
cert-manager
can be used by the DataRobot application to create and manage various certificates including the application.
When cert_manager_letsencrypt_clusterissuers
is enabled, letsencrypt-staging
and letsencrypt-prod
ClusterIssuers will be created which can be used by the datarobot-azure
umbrella chart to issue certificates used by the DataRobot application. The default values in that helm chart (as of version 10.2) have global.ingress.tls.enabled
, global.ingress.tls.certmanager
, and global.ingress.tls.issuer
as letsencrypt-prod
which will use the letsencrypt-prod
ClusterIssuer to issue a public ACME certificate as the TLS certificate used by the Kubernetes ingress resources.
Values passed to the helm chart can be overridden by passing a custom values file via the cert_manager_values
variable as demonstrated in the complete example.
TBD
external_dns
to install theexternal-dns
helm chart
Uses the terraform-helm-release module to install the external-dns
helm chart from the https://charts.bitnami.com/bitnami
repo into the external-dns
namespace.
A User Assigned Identity and Federated Identity Credential is created for the external-dns
service account running in the external-dns
namespace that allows the creation of DNS resources within the specified DNS zone.
external-dns
is used to automatically create DNS records for ingress resources in the Kubernetes cluster. When the DataRobot application is installed and the ingress resources are created, external-dns
will automatically create a DNS record pointing at the ingress resource.
Values passed to the helm chart can be overridden by passing a custom values file via the external_dns_values
variable as demonstrated in the complete example.
TBD
nvidia_device_plugin
to install thenvidia-device-plugin
helm chart
Uses the terraform-helm-release module to install the nvidia-device-plugin
helm chart from the https://nvidia.github.io/k8s-device-plugin
repo into the nvidia-device-plugin
namespace.
This helm chart is used to expose GPU resources on nodes intended for GPU workloads such as the default gpu
node group.
Values passed to the helm chart can be overridden by passing a custom values file via the nvidia_device_plugin_values
variable as demonstrated in the complete example.
Not required
descheduler
to install thedescheduler
helm chart
Uses the terraform-helm-release module to install the descheduler
helm chart from the https://kubernetes-sigs.github.io/descheduler/
helm repo into the descheduler
namespace.
This helm chart allows for automatic rescheduling of pods for optimizing resource consumption.
Not required
TBD
Release | Supported DR Versions |
---|---|
>= 1.0 | >= 10.0 |
Name | Version |
---|---|
terraform | >= 1.3.5 |
azurerm | >= 4.3.0 |
helm | >= 2.15.0 |
kubectl | >= 1.14.0 |
Name | Version |
---|---|
azurerm | >= 4.3.0 |
Name | Source | Version |
---|---|---|
app_identity | ./modules/app-identity | n/a |
cert_manager | ./modules/cert-manager | n/a |
container_registry | ./modules/container-registry | n/a |
descheduler | ./modules/descheduler | n/a |
dns | ./modules/dns | n/a |
external_dns | ./modules/external-dns | n/a |
ingress_nginx | ./modules/ingress-nginx | n/a |
kubernetes | ./modules/kubernetes | n/a |
naming | Azure/naming/azurerm | ~> 0.4 |
network | ./modules/network | n/a |
nvidia_device_plugin | ./modules/nvidia-device-plugin | n/a |
storage | ./modules/storage | n/a |
Name | Type |
---|---|
azurerm_resource_group.this | resource |
azurerm_kubernetes_cluster.existing | data source |
azurerm_subscription.current | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
cert_manager | Install the cert-manager helm chart. All other cert_manager variables are ignored if this variable is false. | bool |
true |
no |
cert_manager_letsencrypt_clusterissuers | Whether to create letsencrypt-prod and letsencrypt-staging ClusterIssuers | bool |
true |
no |
cert_manager_letsencrypt_email_address | Email address for the certificate owner. Let's Encrypt will use this to contact you about expiring certificates, and issues related to your account. Only required if cert_manager_letsencrypt_clusterissuers is true. | string |
"[email protected]" |
no |
cert_manager_values | Path to templatefile containing custom values for the cert-manager helm chart | string |
"" |
no |
cert_manager_variables | Variables passed to the cert_manager_values templatefile | any |
{} |
no |
container_registry_ip_allow_list | List of CIDR blocks to allow access to the container registry. Only IPv4 addresses are allowed | list(string) |
[] |
no |
container_registry_network_rules_default_action | Specifies the default action of allow or deny when no other rules match | string |
"Allow" |
no |
container_registry_public_network_access_enabled | Whether the public network access to the container registry is enabled | bool |
true |
no |
create_app_identity | Create a new user assigned identity for the DataRobot application | bool |
true |
no |
create_container_registry | Create a new Azure Container Registry. Ignored if an existing existing_container_registry_id is specified. | bool |
true |
no |
create_dns_zones | Create DNS zones for domain_name. Ignored if existing_public_dns_zone_id and existing_private_dns_zone_id are specified. | bool |
true |
no |
create_kubernetes_cluster | Create a new Azure Kubernetes Service cluster. All kubernetes and helm chart variables are ignored if this variable is false. | bool |
true |
no |
create_network | Create a new Azure Virtual Network. Ignored if an existing existing_vnet_id is specified. | bool |
true |
no |
create_resource_group | Create a new Azure resource group. Ignored if existing existing_resource_group_name is specified. | bool |
true |
no |
create_storage | Create a new Azure Storage account and container. Ignored if an existing_storage_account_id is specified. | bool |
true |
no |
datarobot_namespace | Kubernetes namespace in which the DataRobot application will be installed | string |
"dr-app" |
no |
datarobot_service_accounts | Names of the Kubernetes service accounts used by the DataRobot application | set(string) |
[ |
no |
descheduler | Install the descheduler helm chart to enable rescheduling of pods. All other descheduler variables are ignored if this variable is false | bool |
true |
no |
descheduler_values | Path to templatefile containing custom values for the descheduler helm chart | string |
"" |
no |
descheduler_variables | Variables passed to the descheduler templatefile | any |
{} |
no |
domain_name | Name of the domain to use for the DataRobot application. If create_dns_zones is true then zones will be created for this domain. It is also used by the cert-manager helm chart for DNS validation and as a domain filter by the external-dns helm chart. | string |
"" |
no |
existing_aks_cluster_name | Name of existing AKS cluster to use. When specified, all other kubernetes variables will be ignored. | string |
null |
no |
existing_container_registry_id | ID of existing container registry to use | string |
"" |
no |
existing_kubernetes_nodes_subnet_id | ID of an existing subnet to use for the AKS node pools. Required when an existing_network_id is specified. Ignored if create_network is true and no existing_network_id is specified. | string |
"" |
no |
existing_private_dns_zone_id | ID of existing private hosted zone to use for private DNS records created by external-dns. This is required when create_dns_zones is false and ingress_nginx is true with internet_facing_ingress_lb false. | string |
"" |
no |
existing_public_dns_zone_id | ID of existing public hosted zone to use for public DNS records created by external-dns and public LetsEncrypt certificate validation by cert-manager. This is required when create_dns_zones is false and ingress_nginx and internet_facing_ingress_lb are true or when cert_manager and cert_manager_letsencrypt_clusterissuers are true. | string |
"" |
no |
existing_resource_group_name | Name of existing resource group to use | string |
"" |
no |
existing_storage_account_id | ID of existing Azure Storage Account to use for DataRobot file storage. When specified, all other storage variables will be ignored. | string |
"" |
no |
existing_vnet_id | ID of an existing VNet to use. When specified, other network variables are ignored. | string |
"" |
no |
external_dns | Install the external_dns helm chart to create DNS records for ingress resources matching the domain_name variable. All other external_dns variables are ignored if this variable is false. | bool |
true |
no |
external_dns_values | Path to templatefile containing custom values for the external-dns helm chart | string |
"" |
no |
external_dns_variables | Variables passed to the external_dns_values templatefile | any |
{} |
no |
ingress_nginx | Install the ingress-nginx helm chart to use as the ingress controller for the AKS cluster. All other ingress_nginx variables are ignored if this variable is false. | bool |
true |
no |
ingress_nginx_values | Path to templatefile containing custom values for the ingress-nginx helm chart | string |
"" |
no |
ingress_nginx_variables | Variables passed to the ingress_nginx_values templatefile | any |
{} |
no |
internet_facing_ingress_lb | Determines the type of Standard Load Balancer created for AKS ingress. If true, a public Standard Load Balancer will be created. If false, an internal Standard Load Balancer will be created. | bool |
true |
no |
kubernetes_cluster_endpoint_public_access | Whether or not the Kubernetes API endpoint should be exposed to the public internet. When false, the cluster endpoint is only available internally to the virtual network. | bool |
true |
no |
kubernetes_cluster_endpoint_public_access_cidrs | List of CIDR blocks which can access the Kubernetes API server endpoint | list(string) |
[] |
no |
kubernetes_cluster_version | AKS cluster version | string |
null |
no |
kubernetes_dns_service_ip | IP address within the Kubernetes service address range that will be used by cluster service discovery (kube-dns) | string |
null |
no |
kubernetes_gpu_nodepool_labels | A map of Kubernetes labels to apply to the GPU node pool | map(string) |
{ |
no |
kubernetes_gpu_nodepool_max_count | Maximum number of nodes in the GPU node pool | number |
10 |
no |
kubernetes_gpu_nodepool_min_count | Minimum number of nodes in the GPU node pool | number |
0 |
no |
kubernetes_gpu_nodepool_name | Name of the GPU node pool | string |
"gpu" |
no |
kubernetes_gpu_nodepool_node_count | Node count of the GPU node pool | number |
0 |
no |
kubernetes_gpu_nodepool_taints | A list of Kubernetes taints to apply to the GPU node pool | list(string) |
[ |
no |
kubernetes_gpu_nodepool_vm_size | VM size used for the GPU node pool | string |
"Standard_NC4as_T4_v3" |
no |
kubernetes_nodepool_availability_zones | Availability zones to use for the AKS node pools | set(string) |
[ |
no |
kubernetes_pod_cidr | The CIDR to use for Kubernetes pod IP addresses | string |
null |
no |
kubernetes_primary_nodepool_labels | A map of Kubernetes labels to apply to the primary node pool | map(string) |
{ |
no |
kubernetes_primary_nodepool_max_count | Maximum number of nodes in the primary node pool | number |
10 |
no |
kubernetes_primary_nodepool_min_count | Minimum number of nodes in the primary node pool | number |
1 |
no |
kubernetes_primary_nodepool_name | Name of the primary node pool | string |
"primary" |
no |
kubernetes_primary_nodepool_node_count | Node count of the primary node pool | number |
1 |
no |
kubernetes_primary_nodepool_taints | A list of Kubernetes taints to apply to the primary node pool | list(string) |
[] |
no |
kubernetes_primary_nodepool_vm_size | VM size used for the primary node pool | string |
"Standard_D32s_v4" |
no |
kubernetes_service_cidr | The CIDR to use for Kubernetes service IP addresses | string |
null |
no |
location | Azure location to create resources in | string |
n/a | yes |
name | Name to use as a prefix for created resources | string |
n/a | yes |
network_address_space | CIDR block to be used for the new VNet. By default, AKS uses 10.0.0.0/16 for services and 10.244.0.0/16 for pods. This should not overlap with the kubernetes_service_cidr or kubernetes_pod_cidr variables. | string |
"10.1.0.0/16" |
no |
nvidia_device_plugin | Install the nvidia-device-plugin helm chart to expose node GPU resources to the AKS cluster. All other nvidia_device_plugin variables are ignored if this variable is false. | bool |
true |
no |
nvidia_device_plugin_values | Path to templatefile containing custom values for the nvidia-device-plugin helm chart | string |
"" |
no |
nvidia_device_plugin_variables | Variables passed to the nvidia_device_plugin_values templatefile | any |
{} |
no |
storage_account_replication_type | Storage account data replication type as described in https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy | string |
"ZRS" |
no |
storage_network_rules_default_action | Specifies the default action of the storage firewall to allow or deny when no other rules match | string |
"Allow" |
no |
storage_public_ip_allow_list | List of public IP or IP ranges in CIDR Format which are allowed to access the storage account. Only IPv4 addresses are allowed. /31 CIDRs, /32 CIDRs, and Private IP address ranges (as defined in RFC 1918), are not allowed. Ignored if storage_public_network_access_enabled is false. | list(string) |
[] |
no |
storage_public_network_access_enabled | Whether the public network access to the storage account is enabled | bool |
true |
no |
storage_virtual_network_subnet_ids | List of resource IDs for subnets which are allowed to access the storage account | list(string) |
null |
no |
tags | A map of tags to add to all created resources | map(string) |
{ |
no |
Name | Description |
---|---|
aks_cluster_id | ID of the Azure Kubernetes Service cluster |
container_registry_admin_password | Admin password of the container registry |
container_registry_admin_username | Admin username of the container registry |
container_registry_id | ID of the container registry |
container_registry_login_server | The URL that can be used to log into the container registry |
private_zone_id | ID of the private zone |
public_zone_id | ID of the public zone |
resource_group_id | The ID of the Resource Group |
storage_access_key | The primary access key for the storage account |
storage_account_name | Name of the storage account |
storage_container_name | Name of the storage container |
user_assigned_identity_client_id | Client ID of the user assigned identity |
user_assigned_identity_id | ID of the user assigned identity |
user_assigned_identity_name | Name of the user assigned identity |
user_assigned_identity_principal_id | Principal ID of the user assigned identity |
user_assigned_identity_tenant_id | Tenant ID of the user assigned identity |
vnet_id | The ID of the VNet |