Skip to content

NVIDIA/cloud-native-stack

Repository files navigation

NVIDIA Cloud Native Stack

Introduction

NVIDIA Cloud Native Stack (CNS) is a collection of software to run cloud native workloads on NVIDIA GPUs. NVIDIA Cloud Native Stack is based on Ubuntu/RHEL, Kubernetes, Helm and the NVIDIA GPU and Network Operator.

Interested in deploying NVIDIA Cloud Native Stack? This repository has install guides for manual installations and ansible playbooks for automated installations.

Interested in a pre-provisioned NVIDIA Cloud Native Stack environment? NVIDIA LaunchPad provides pre-provisioned environments so that you can quickly get started.

Objective

  • CNS comes as a reference architecture that list all components that have been tested successfully together. the CNS reference architecture can be used as specification for production deployments.

  • CNS also comes as installation guides and playbook that can be used to instantiate a quick K8s environment with NVIDIA operators. The CNS installation guides and playbook are intended only for test and PoC environments.

    Note: The K8s layer that CNS install guide or playbook deploys is basic (no HA for instance) and as such cannot be used for production. However all NVIDIA componants in CNS are fully operable in production environment.

Life Cycle

When NVIDIA Cloud Native Stack batch is released, the previous batch enters maintenance support and only receives patch release updates. All prior batches enter end-of-life (EOL) and are no longer supported and do not receive patch updates.

Note: Upgrades are only supported from previous batch to latest batch.

Batch Status
24.11.0 Generally Available
24.8.1 Maintenance
24.5.0 and lower EOL

For more information, Refer Cloud Native Stack Releases

Component Matrix

Cloud Native Stack Batch 24.11.0 (Release Date: 14 November 2024)

CNS Version 14.0 13.2 12.3
Platforms
  • NVIDIA Certified Server (x86 & arm64)
  • DGX Server
  • NVIDIA Certified Server (x86 & arm64)
  • DGX Server
  • NVIDIA Certified Server (x86 & arm64)
  • DGX Server
Supported OS
  • Ubuntu 22.04 LTS
  • RHEL 8.10
  • DGX OS 6.2(Ubuntu 22.04 LTS)
  • Ubuntu 22.04 LTS
  • RHEL 8.10
  • DGX OS 6.2(Ubuntu 22.04 LTS)
  • Ubuntu 22.04 LTS
  • RHEL 8.10
  • DGX OS 6.2(Ubuntu 22.04 LTS)
Containerd 1.7.23 1.7.23 1.7.23
NVIDIA Container Toolkit 1.17.0 1.17.0 1.17.0
CRI-O 1.31.2 1.30.6 1.29.10
Kubernetes 1.31.2 1.30.6 1.29.10
CNI (Calico) 3.28.2 3.28.2 3.28.2
NVIDIA GPU Operator 24.9.0 24.9.0 24.9.0
NVIDIA Network Operator 24.7.0 24.7.0 24.7.0
NVIDIA Data Center Driver 550.127.05 550.127.05 550.127.05
Helm 3.16.2 3.16.2 3.16.2

Note: To Previous Cloud Native Stack release information can be found here

NOTE: Cloud Native Stack versions are available with the master branch but it's recommend to use the specific branch.

Software

NOTE: currently MicroK8s functionality is limited with GPU Operator 24.9.0 as there's known bug. we expected to fix this with another release soon.

CNS Version 14.0 13.2 12.3
MicroK8s 1.31 1.30 1.29
KServe
0.14

  • Istio: 1.23.2
  • Knative: 1.15.7
  • CertManager: 1.16.1

0.14

  • Istio: 1.23.2
  • Knative: 1.15.7
  • CertManager: 1.16.1

0.14

  • Istio: 1.23.2
  • Knative: 1.15.7
  • CertManager: 1.16.1
LeaderWorkerSet 0.4.1 0.4.1 0.4.1
LoadBalancer MetalLB: 0.14.5 MetalLB: 0.14.5 MetalLB: 0.14.5
Storage NFS: 4.0.18
Local Path: 0.0.30
NFS: 4.0.18
Local Path: 0.0.30
NFS: 4.0.18
Local Path: 0.0.30
Monitoring Prometheus: 25.27.0
Prometheus Adapter: 4.11.0
Elastic: 8.15.3
Prometheus: 25.27.0
Prometheus Adapter: 4.11.0
Elastic: 8.15.3
Prometheus: 25.27.0
Prometheus Adapter: 4.11.0
Elastic: 8.15.3

Getting Started

Prerequisites

Please make sure to meet the following prerequisites to Install the Cloud Native Stack

  • system has direct internet access
  • system should have an Operating system either Ubuntu 22.04 and above or RHEL 8.10
  • system has adequate internet bandWidth
  • DNS server is working fine on the System
  • system can access Google repo(for k8s installation)
  • system has only 1 network interface configured with internet access. The IP is static and doesn't change
  • UEFI secure boot is disabled
  • Root file system should has at least 40GB capacity
  • system has 2CPU and 4GB Memory
  • At least one NVIDIA GPU attached to the system

Installation

Run the below commands to clone the NVIDIA Cloud Native Stack.

git clone https://github.com/NVIDIA/cloud-native-stack.git
cd cloud-native-stack/playbooks

Update the hosts file in playbooks directory with master and worker nodes(if you have) IP's with username and password like below

nano hosts

[master]
<master-IP> ansible_ssh_user=nvidia ansible_ssh_pass=nvidipass ansible_sudo_pass=nvidiapass ansible_ssh_common_args='-o StrictHostKeyChecking=no'
[nodes]
<worker-IP> ansible_ssh_user=nvidia ansible_ssh_pass=nvidiapass ansible_sudo_pass=nvidiapass ansible_ssh_common_args='-o StrictHostKeyChecking=no'

Install the NVIDIA Cloud Native Stack stack by running the below command. "Skipping" in the ansible output refers to the Kubernetes cluster is up and running.

bash setup.sh install

For more Information about customize the values, please refer Installation

Topologies

  • Cloud Native Stack allows to deploy:
    • 1 node with both control plane and worker functionalities
    • 1 control plane node and any number of worker nodes

NOTE: (Cloud Native Stack does not allow the deployment of several control plane nodes)

Troubleshooting

Troubleshoot CNS installation issues

Getting help or Providing feedback

Please open an issue on the GitHub project for any questions. Your feedback is appreciated.

Useful Links