Red Hat OpenShift Container Platform is a security-centric and enterprise-grade hardened Kubernetes platform for deploying and managing Kubernetes clusters at scale, developed and supported by Red Hat. Red Hat OpenShift Container Platform includes enhancements to Kubernetes so users can easily configure and use GPU resources for accelerating workloads such as deep learning.The GPU operator manages NVIDIA GPU resources in a Openshift cluster and automates tasks related to bootstrapping GPU nodes. Since the GPU is a special resource in the cluster, it requires a few components to be installed before application workloads can be deployed onto the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin, container runtime and others such as automatic node labelling, monitoring etc.
The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, automatic node labelling using GFD, DCGM based monitoring and others.
Before following the steps in this guide, ensure that your environment has:
The preferred method to deploy the GPU Operator is using helm.
Below is the command to install helm
To validate helm installation, below command should be returning a table if helm is installed successfully
Now, add the Rocket Helm repository for Gpu Operator:
To validate repo addition, below command should list the added rocketgpu
helm repo list
Installation of GPU Operator can be done using the below command. This will use the default configurations.
helm install --wait --generate-name rocketgpu/gpu-operator
The GPU Operator Helm chart offers a number of customizable options that can be configured depending on your environment.
Chart Customization Options:
The following options are available when using the Helm chart. These options can be used with --set when installing via Helm.
Deploys Node Feature Discovery plugin as a daemonset. Set this variable to false if NFD is already running in the cluster.
By default, the operator assumes your Kubernetes deployment is running with docker as its container runtime. Other values are either crio (for CRI-O) or containerd (for containerd).
Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either mixed or single.
The GPU operator deploys PodSecurityPolicies if enabled.
By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to false when using the Operator on systems with pre-installed drivers.
The images are downloaded from NGC. Specify another image repository when using custom driver images.
Version of the NVIDIA datacenter driver supported by the Operator.
Depends on the version of the Operator. See the Component Matrix for more information on supported drivers.
Controls whether the driver daemonset should build and load the nvidia-peermem kernel module.
By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a container on the system. Set this value to false when using the Operator on systems with pre-installed NVIDIA runtimes.
The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).
The above parameters can be used to customize the installation. For example if the user has already pre-installed NVIDIA drivers as part of the system image:
helm install --wait --generate-name rocketgpu/gpu-operator --set driver.enabled=false
In another scenario if you want to install gpu operator with specific version and use secrets to pull image
The commands below describe various ways to verify the successful installation of the NVIDIA GPU Operator.
oc get pods,daemonset -n gpu-operator-resources
------------------------------Uvaise AhamedRocket Internal - All Brands------------------------------
The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, automatic node labelling using