Skip to content

Latest commit

 

History

History
178 lines (142 loc) · 7.43 KB

images.md

File metadata and controls

178 lines (142 loc) · 7.43 KB

Images

FAQ | Troubleshooting | Glossary

Overview

Google Cloud Platform instances require a source image or source image family which the instance will boot from. SchedMD provides public images for Slurm instances, which contain an HPC software stack for HPC ready images. Otherwise, custom images can be created and used instead.

Supported Operating Systems

slurm-gcp generally supports images built on these OS families:

Project Image Family Arch
cloud-hpc-image-public hpc-centos-7 x86_64
cloud-hpc-image-public hpc-rocky-linux-8 x86_64
debian-cloud debian-11 x86_64
ubuntu-os-cloud ubuntu-2004-lts x86_64
ubuntu-os-cloud ubuntu-2204-lts-arm64 ARM64

Installed Software for HPC

  • Slurm
    • 23.02.4
  • lmod
  • openmpi
    • v4.1.x
  • cuda
    • Limited to x86_64 only
    • Latest CUDA and NVIDIA
    • NVIDIA 470 and CUDA 11.4.4 installed on hpc-centos-7-k80 variant image for compatibility with K80 GPUs.
  • lustre
    • Only supports x86_64
    • Client version 2.12-2.15 depending on the package available for the image OS.

Public Image

SchedMD releases public images on Google Cloud Platform that are minimal viable images for deploying Slurm clusters through all method and configurations.

NOTE: SchedMD generates images using the same process as documented in custom images but without any additional software and only using clean minimal base images for the source image (e.g. ubuntu-os-cloud/ubuntu-2004-lts).

For the TPU nodes docker images are also released.

Published Image Family

Project Image Family Arch Status
schedmd-slurm-public slurm-gcp-6-1-debian-11 x86_64 Supported
schedmd-slurm-public slurm-gcp-6-1-hpc-rocky-linux-8 x86_64 Supported
schedmd-slurm-public slurm-gcp-6-1-ubuntu-2004-lts x86_64 Supported
schedmd-slurm-public slurm-gcp-6-1-ubuntu-2204-lts-arm64 ARM64 Supported
schedmd-slurm-public slurm-gcp-6-1-hpc-centos-7-k80 x86_64 EOL 2024-05-01
schedmd-slurm-public slurm-gcp-6-1-hpc-centos-7 x86_64 EOL 2024-01-01

Published Docker Image Family

Project Image Family Status
schedmd-slurm-public tpu:slurm-gcp-6-1-tf-2.8.0 Supported
schedmd-slurm-public tpu:slurm-gcp-6-1-tf-2.8.3 Supported
schedmd-slurm-public tpu:slurm-gcp-6-1-tf-2.9.1 Supported
schedmd-slurm-public tpu:slurm-gcp-6-1-tf-2.9.3 Supported
schedmd-slurm-public tpu:slurm-gcp-6-1-tf-2.10.0 Supported
schedmd-slurm-public tpu:slurm-gcp-6-1-tf-2.10.1 Supported
schedmd-slurm-public tpu:slurm-gcp-6-1-tf-2.11.0 Supported
schedmd-slurm-public tpu:slurm-gcp-6-1-tf-2.11.1 Supported
schedmd-slurm-public tpu:slurm-gcp-6-1-tf-2.12.0 Supported
schedmd-slurm-public tpu:slurm-gcp-6-1-tf-2.12.1 Supported
schedmd-slurm-public tpu:slurm-gcp-6-1-tf-2.13.0 Supported

Custom Image

To create slurm_cluster compliant images yourself, a custom Slurm image can be created. Packer and Ansible are used to orchestrate custom image creation.

Custom images can be built from a supported private or public image (e.g. hpc-centos-7, centos-7). Additionally, ansible roles or scripts can be added into the provisioning process to install custom software and configure the custom Slurm image.

Requirements

Creation

Install software dependencies and build images from configation.

See slurm-gcp packer project for details.

Customize

Before you build your images with packer, you can modify how the build will happen. Custom packages and other image configurations can be added by a few methods. All methods below may be used together in any combination, if desired.

  • Role scripts runs all scripts globbed from scripts.d. This method is intended for simple configuration scripts.
  • Image configuration can be extended by specifying extra custom playbooks using the input variable extra_ansible_provisioners. These playbooks will be applied after Slurm installation is complete. For example, the following configuration will run a playbook without any dependencies on extra Ansible Galaxy roles:
    extra_ansible_provisioners = [
      {
        playbook_file   = "/home/username/playbooks/custom.yaml"
        galaxy_file     = null
        extra_arguments = ["-vv"]
        user            = null
      },
    ]
  • The Slurm image can be built on top of an existing image. Configure the pkrvars file with source_image or source_image_family pointing to your image. This is intended for more complex configurations because of workflow or pipelines.

Shielded VM Support

Recently published images in project schedmd-slurm-public support shielded VMs without GPUs or mounting a Lustre filesystem. Both of these features require kernel modules, which must be signed to be compatible with SecureBoot.

If you need GPUs, our published image family based on ubuntu-os-cloud/ubuntu-2004-lts has signed Nvidia drivers installed and therefore supports GPUs with SecureBoot and Shielded VMs.

If you need Lustre or GPUs on a different OS, it is possible to do this manually with a custom image. Doing this requires

  • generating a private/public key pair with openssl
  • signing the needed kernel modules
  • including the public key in the UEFI authorized keys db of the image
    • gcloud compute images create
    • option: --signature-database-file
    • Default Microsoft keys should be included as well because this overwrites the default key database.
    • Unfortunately, it appears that packer does not support this image creation option at this time, so the image creation step must be manual.

More details on this process are beyond the scope of this documentation. See link and/or contact Google for more information.