RocketCE for Power

View Only

Back to discussions

Expand all | Collapse all

Installing GPU Operator in Openshift environment for ppc64le

1. Installing GPU Operator in Openshift environment for ppc64le

ROCKETEER

Uvaise Ahamed

Posted 11-12-2021 08:52
Edited by Uvaise Ahamed 05-31-2022 14:19

Introduction

Red Hat OpenShift Container Platform is a security-centric and enterprise-grade hardened Kubernetes platform for deploying and managing Kubernetes clusters at scale, developed and supported by Red Hat. Red Hat OpenShift Container Platform includes enhancements to Kubernetes so users can easily configure and use GPU resources for accelerating workloads such as deep learning.

The GPU operator manages NVIDIA GPU resources in a Openshift cluster and automates tasks related to bootstrapping GPU nodes. Since the GPU is a special resource in the cluster, it requires a few components to be installed before application workloads can be deployed onto the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin, container runtime and others such as automatic node labelling, monitoring etc.

The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, automatic node labelling using GFD, DCGM based monitoring and others.

Prerequisites

Before following the steps in this guide, ensure that your environment has:

A working OpenShift cluster up and running with a GPU worker node. See OpenShift Container Platform installation for guidance on installing. Refer to Container Platforms for the support matrix of the GPU Operator releases and the supported container platforms for more information.
Access to the OpenShift cluster as a cluster-admin to perform the necessary steps.
OpenShift CLI (oc) installed.
RedHat Enterprise Linux (RHEL) 8.X
Ensure that the appropriate Red Hat subscriptions and entitlements for OpenShift are properly enabled. For more details on enabling entitlements please refer Step 1 and Step 2 in this page
Install the Node Feature Discovery (NFD) operator using Step 3 of this page
API Key to access images from icr.io. If you dont have one, request api key to access icr.io to pull images at following email address rocketce@rocketsoftware.com. For more details refer IBM Cloud Docs .

Verification of Prerequisites

1. Verify entitlements

# oc get machineconfig | grep entitlement
50-entitlement-key-pem                                                                                       2.2.0             4d1h
50-entitlement-pem                                                                                           2.2.0             4d1h

# oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-30627edc9bc847e2f6a7a9755561f65c   True      False      False      3              3                   3                     0                      24d
worker   rendered-worker-8d10dd8f6569ba1505d831e95a6b7d6c   True      False      False      4              4                   4                     0                      24d

2. Verify Node Feature Discovery (NFD) operator is installed

NFD can be validated by checking in web console under the menu Operators -> Installed Operators.
Also the below command can be run to check using CLI.

# oc get deployment -n openshift-operators
NAME                   READY  UP-TO-DATE   AVAILABLE    AGE
nfd-controller-manager 1/1    1            1            2d22h

Deployment steps

The preferred method to deploy the GPU Operator is using helm.

1. Create Project:

Create a project using Openshift console which will be used to create the resources of Gpu operator

2. Install Helm:

Below is the command to install helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 \

&& chmod 700 get_helm.sh \

&& ./get_helm.sh

To validate helm installation, below command should be returning a table if helm is installed successfully

helm list

3. Add Rocket Helm repository for Gpu Operator

Now, add the Rocket Helm repository for Gpu Operator:

helm repo add rocketgpu https://rocketsoftware.github.io/gpu-operator \
&& helm repo update

To validate repo addition, below command should list the added rocketgpu

helm repo list

4. Create an image pull secret to store the API key credentials

a) Create the secret for API Key obtained from Rocket to access the images in icr.io:

oc --namespace <project> create secret docker-registry icrlogin --docker-server=<registry_URL> --docker-username=iamapikey --docker-password=<api_key_value> --docker-email=<docker_email>

b) Verify the secret creation, below command gives list of secrets in the namespace. The name "icrlogin" must be present there.

oc get secrets --namespace <project>

c) Store the image pull secret in the Kubernetes service account for the selected project. Every OpenShift project has a Kubernetes service account that is named default. Within the project, you can add the image pull secret to this service account to grant access for pods to pull images from your registry. Deployments that do not specify a service account automatically use the default service account for this OpenShift project.

Check if an image pull secret already exists for your default service account.

oc describe serviceaccount default -n <project_name>

When <none> is displayed in the Image pull secrets entry, no image pull secret exists.

d) Add the image pull secret to your default service account.

Example command to add the image pull secret when no image pull secret is defined.

oc patch -n <project_name> serviceaccount/default -p '{"imagePullSecrets":[{"name": "icrlogin"}]}'

Example command to add the image pull secret when an image pull secret is already defined.

oc patch -n <project_name> serviceaccount/default --type='json' -p='[{"op":"add","path":"/imagePullSecrets/-","value":{"name":"icrlogin"}}]'

e) Verify that your image pull secret was added to your default service account.

oc describe serviceaccount default -n <project_name>

Example output

Name: default

Namespace: <namespace_name>

Labels: <none>

Annotations: <none>

Image pull secrets: <image_pull_secret_name>

Mountable secrets: default-token-sh2dx

Tokens: default-token-sh2dx

Events: <none>

If the Image pull secrets says <secret> (not found), verify that the image pull secret exists in the same project as your service account by running oc get secrets -n project.

5. Installation of GPU Operator

Installation of GPU Operator can be done using the below command. This will use the default configurations.

helm install --wait --generate-name rocketgpu/gpu-operator -n <project_name>

The GPU Operator Helm chart offers a number of customizable options that can be configured depending on your environment.

Chart Customization Options:

The following options are available when using the Helm chart. These options can be used with --set when installing via Helm.

Parameter	Description	Default
nfd.enabled	Deploys Node Feature Discovery plugin as a daemonset. Set this variable to false if NFD is already running in the cluster.	true
operator.defaultRuntime	By default, the operator assumes your Kubernetes deployment is running with docker as its container runtime. Other values are either crio (for CRI-O) or containerd (for containerd).	docker
mig.strategy	Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either mixed or single.	single
psp.enabled	The GPU operator deploys PodSecurityPolicies if enabled.	false
driver.enabled	By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to false when using the Operator on systems with pre-installed drivers.	true
driver.repository	The images are downloaded from NGC. Specify another image repository when using custom driver images.	icr.io/rocketce
driver.version	Version of the NVIDIA datacenter driver supported by the Operator.	Depends on the version of the Operator. See the Component Matrix for more information on supported drivers.
driver.rdma.enabled	Controls whether the driver daemonset should build and load the nvidia-peermem kernel module.	false
toolkit.enabled	By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a container on the system. Set this value to false when using the Operator on systems with pre-installed NVIDIA runtimes.	true
migManager.enabled	The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).	true

The above parameters can be used to customize the installation. For example if the user has already pre-installed NVIDIA drivers as part of the system image:

helm install --wait --generate-name rocketgpu/gpu-operator --set driver.enabled=false

In another scenario if you want to install gpu operator with specific version and use secrets to pull image

helm install --wait --generate-name \

rocketgpu/gpu-operator

--set driver.version=$VERSION \

--set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \

--set driver.licensingConfig.configMapName=licensing-config

6. Verify the successful installation of the NVIDIA GPU Operator

The commands below describe various ways to verify the successful installation of the NVIDIA GPU Operator.

oc get pods,daemonset -n gpu-operator-resources

NAME	READY	STATUS	RESTARTS	AGE
pod/gpu-feature-discovery-vwhnt	1/1	Running	0	6m32s
pod/nvidia-container-toolkit-daemonset-k8x28	1/1	Running	0	6m33s
pod/nvidia-cuda-validator-xr5sz	0/1	Completed	0	90s
pod/nvidia-dcgm-5grvn	1/1	Running	0	6m32s
pod/nvidia-dcgm-exporter-cp8ml	1/1	Running	0	6m32s
pod/nvidia-device-plugin-daemonset-p9dp4	1/1	Running	0	6m32s
pod/nvidia-device-plugin-validator-mrhst	0/1	Completed	0	48s
pod/nvidia-driver-daemonset-pbplc	1/1	Running	0	6m33s
pod/nvidia-node-status-exporter-s2ml2	1/1	Running	0	6m33s
pod/nvidia-operator-validator-44jdf	1/1	Running	0	6m32s

NAME	DESIRED	CURRENT	READY	UP-TO-DATE	AVAILABLE	NODE SELECTOR	AGE
daemonset.apps/gpu-feature-discovery	1	1	1	1	1	nvidia.com/gpu.deploy.gpu-feature-discovery=true	6m32s
daemonset.apps/nvidia-container-toolkit-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.container-toolkit=true	6m33s
daemonset.apps/nvidia-dcgm	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm=true	6m33s
daemonset.apps/nvidia-dcgm-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm-exporter=true	6m33s
daemonset.apps/nvidia-device-plugin-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.device-plugin=true	6m33s
daemonset.apps/nvidia-driver-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.driver=true	6m33s
daemonset.apps/nvidia-mig-manager	0	0	0	0	0	nvidia.com/gpu.deploy.mig-manager=true	6m32s
daemonset.apps/nvidia-node-status-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.node-status-exporter=true	6m34s
daemonset.apps/nvidia-operator-validator	1	1	1	1	1	nvidia.com/gpu.deploy.operator-validator=true	6m33s

7. Running Sample GPU Applications using a sample pod

Create a sample pod as mentioned below and validate the successful installation of Gpu operator.

cat<< EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-operator-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "quay.io/mgiessing/cuda-sample:vectoradd-cuda11.2.1-ubi8"
resources:
limits:
nvidia.com/gpu: 1
EOF

8. Running Sample GPU Applications using a sample deployment

We can apply a sample deployment file and verify the logs after the pod is in running state.

[root@cp4d-2-bastion sample_ppc64le]# cat cuda-demo1.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cuda-demo
spec:
replicas: 1
selector:
matchLabels:
app: cuda-demo
template:
metadata:
labels:
app: cuda-demo
spec:
selector:
matchLabels:
app: cuda-demo
containers:
- name: cuda-demo
image: nvidia/cuda-ppc64le:11.4.0-runtime
command: ["/bin/sh", "-c"]
args: ["nvidia-smi && tail -f /dev/null"]
resources:
limits:
nvidia.com/gpu: 1

9. Uninstall

To uninstall the operator:

helm delete -n gpu-operator $(helm list -n gpu-operator | grep gpu-operator | awk '{print $1}')

You should now see all the pods being deleted:

oc get pods -n gpu-operator-resources
No resources found.

Also, ensure that CRDs created during the operator install have been removed:

oc get crds -A | grep -i clusterpolicies.nvidia.com
and
oc delete crd clusterpolicies.nvidia.com

Note:
After un-install of GPU Operator, the NVIDIA driver modules might still be loaded. Either reboot the node or unload them using the following command:

sudo rmmod nvidia_modeset nvidia_uvm nvidia

------------------------------
Uvaise Ahamed
Rocket Internal - All Brands
------------------------------

2. RE: Installing GPU Operator in Openshift environment for ppc64le

ROCKETEER

Suganya Koteeswaran

Posted 12-01-2021 07:06

What are the supported versions of Openshift?

------------------------------
Suganya Koteeswaran
Rocket Internal - All Brands
------------------------------

Original Message

Original Message:
Sent: 11-12-2021 08:43
From: Uvaise Ahamed
Subject: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Before following the steps in this guide, ensure that your environment has:

A working OpenShift cluster up and running with a GPU worker node. See OpenShift Container Platform installation for guidance on installing. Refer to Container Platforms for the support matrix of the GPU Operator releases and the supported container platforms for more information.
Access to the OpenShift cluster as a cluster-admin to perform the necessary steps.
OpenShift CLI (oc) installed.
RedHat Enterprise Linux (RHEL) 8.X
Ensure that the appropriate Red Hat subscriptions and entitlements for OpenShift are properly enabled.
API Key to access images from icr.io. If you dont have one, request api key to access icr.io to pull images at following email address rocketce@rocketsoftware.com. For more details refer IBM Cloud Docs .

Deployment steps

The preferred method to deploy the GPU Operator is using helm.

1. Install Helm:

Below is the command to install helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 \

&& chmod 700 get_helm.sh \

&& ./get_helm.sh

To validate helm installation, below command should be returning a table if helm is installed successfully

helm list

2. Add Rocket Helm repository for Gpu Operator

Now, add the Rocket Helm repository for Gpu Operator:

helm repo add rocketgpu https://rocketsoftware.github.io/gpu-operator \
&& helm repo update

To validate repo addition, below command should list the added rocketgpu

helm repo list

3. Create an image pull secret to store the API key credentials

a) Create the secret for API Key obtained from Rocket to access the images in icr.io:

oc --namespace <project> create secret docker-registry icrlogin --docker-server=<registry_URL> --docker-username=iamapikey --docker-password=<api_key_value> --docker-email=<docker_email>

b) Verify the secret creation, below command gives list of secrets in the namespace. The name "icrlogin" must be present there.

oc get secrets --namespace <project>

Check if an image pull secret already exists for your default service account.

oc describe serviceaccount default -n <project_name>

When <none> is displayed in the Image pull secrets entry, no image pull secret exists.

d) Add the image pull secret to your default service account.

Example command to add the image pull secret when no image pull secret is defined.

oc patch -n <project_name> serviceaccount/default -p '{"imagePullSecrets":[{"name": "icrlogin"}]}'

Example command to add the image pull secret when an image pull secret is already defined.

oc patch -n <project_name> serviceaccount/default --type='json' -p='[{"op":"add","path":"/imagePullSecrets/-","value":{"name":"icrlogin"}}]'

e) Verify that your image pull secret was added to your default service account.

oc describe serviceaccount default -n <project_name>

Example output

Name: default

Namespace: <namespace_name>

Labels: <none>

Annotations: <none>

Image pull secrets: <image_pull_secret_name>

Mountable secrets: default-token-sh2dx

Tokens: default-token-sh2dx

Events: <none>

If the Image pull secrets says <secret> (not found), verify that the image pull secret exists in the same project as your service account by running oc get secrets -n project.

4. Installation of GPU Operator

Installation of GPU Operator can be done using the below command. This will use the default configurations.

helm install --wait --generate-name rocketgpu/gpu-operator

The GPU Operator Helm chart offers a number of customizable options that can be configured depending on your environment.

Chart Customization Options:

The following options are available when using the Helm chart. These options can be used with --set when installing via Helm.

Parameter	Description	Default
nfd.enabled	Deploys Node Feature Discovery plugin as a daemonset. Set this variable to false if NFD is already running in the cluster.	true
operator.defaultRuntime	By default, the operator assumes your Kubernetes deployment is running with docker as its container runtime. Other values are either crio (for CRI-O) or containerd (for containerd).	docker
mig.strategy	Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either mixed or single.	single
psp.enabled	The GPU operator deploys PodSecurityPolicies if enabled.	false
driver.enabled	By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to false when using the Operator on systems with pre-installed drivers.	true
driver.repository	The images are downloaded from NGC. Specify another image repository when using custom driver images.	nvcr.io/nvidia
driver.version	Version of the NVIDIA datacenter driver supported by the Operator.	Depends on the version of the Operator. See the Component Matrix for more information on supported drivers.
driver.rdma.enabled	Controls whether the driver daemonset should build and load the nvidia-peermem kernel module.	false
toolkit.enabled	By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a container on the system. Set this value to false when using the Operator on systems with pre-installed NVIDIA runtimes.	true
migManager.enabled	The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).	true

The above parameters can be used to customize the installation. For example if the user has already pre-installed NVIDIA drivers as part of the system image:

helm install --wait --generate-name rocketgpu/gpu-operator --set driver.enabled=false

In another scenario if you want to install gpu operator with specific version and use secrets to pull image

helm install --wait --generate-name \

rocketgpu/gpu-operator

--set driver.version=$VERSION \

--set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \

--set driver.licensingConfig.configMapName=licensing-config

5. Verify the successful installation of the NVIDIA GPU Operator

The commands below describe various ways to verify the successful installation of the NVIDIA GPU Operator.

oc get pods,daemonset -n gpu-operator-resources

NAME	READY	STATUS	RESTARTS	AGE
pod/gpu-feature-discovery-vwhnt	1/1	Running	0	6m32s
pod/nvidia-container-toolkit-daemonset-k8x28	1/1	Running	0	6m33s
pod/nvidia-cuda-validator-xr5sz	0/1	Completed	0	90s
pod/nvidia-dcgm-5grvn	1/1	Running	0	6m32s
pod/nvidia-dcgm-exporter-cp8ml	1/1	Running	0	6m32s
pod/nvidia-device-plugin-daemonset-p9dp4	1/1	Running	0	6m32s
pod/nvidia-device-plugin-validator-mrhst	0/1	Completed	0	48s
pod/nvidia-driver-daemonset-pbplc	1/1	Running	0	6m33s
pod/nvidia-node-status-exporter-s2ml2	1/1	Running	0	6m33s
pod/nvidia-operator-validator-44jdf	1/1	Running	0	6m32s

NAME	DESIRED	CURRENT	READY	UP-TO-DATE	AVAILABLE	NODE SELECTOR	AGE
daemonset.apps/gpu-feature-discovery	1	1	1	1	1	nvidia.com/gpu.deploy.gpu-feature-discovery=true	6m32s
daemonset.apps/nvidia-container-toolkit-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.container-toolkit=true	6m33s
daemonset.apps/nvidia-dcgm	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm=true	6m33s
daemonset.apps/nvidia-dcgm-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm-exporter=true	6m33s
daemonset.apps/nvidia-device-plugin-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.device-plugin=true	6m33s
daemonset.apps/nvidia-driver-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.driver=true	6m33s
daemonset.apps/nvidia-mig-manager	0	0	0	0	0	nvidia.com/gpu.deploy.mig-manager=true	6m32s
daemonset.apps/nvidia-node-status-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.node-status-exporter=true	6m34s
daemonset.apps/nvidia-operator-validator	1	1	1	1	1	nvidia.com/gpu.deploy.operator-validator=true	6m33s

------------------------------
Uvaise Ahamed
Rocket Internal - All Brands
------------------------------

3. RE: Installing GPU Operator in Openshift environment for ppc64le

ROCKETEER

Suyog Jadhav

Posted 12-03-2021 07:58

NVidia GPU operator for ppcle64 platform is currently supported on OpenShift 4.8.

------------------------------
Suyog Jadhav
Rocket Internal - All Brands
------------------------------

Original Message

Original Message:
Sent: 12-01-2021 02:03
From: Suganya Koteeswaran
Subject: Installing GPU Operator in Openshift environment for ppc64le

What are the supported versions of Openshift?

------------------------------
Suganya Koteeswaran
Rocket Internal - All Brands

Original Message:
Sent: 11-12-2021 08:43
From: Uvaise Ahamed
Subject: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Before following the steps in this guide, ensure that your environment has:

A working OpenShift cluster up and running with a GPU worker node. See OpenShift Container Platform installation for guidance on installing. Refer to Container Platforms for the support matrix of the GPU Operator releases and the supported container platforms for more information.
Access to the OpenShift cluster as a cluster-admin to perform the necessary steps.
OpenShift CLI (oc) installed.
RedHat Enterprise Linux (RHEL) 8.X
Ensure that the appropriate Red Hat subscriptions and entitlements for OpenShift are properly enabled.
API Key to access images from icr.io. If you dont have one, request api key to access icr.io to pull images at following email address rocketce@rocketsoftware.com. For more details refer IBM Cloud Docs .

Deployment steps

The preferred method to deploy the GPU Operator is using helm.

1. Install Helm:

Below is the command to install helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 \

&& chmod 700 get_helm.sh \

&& ./get_helm.sh

To validate helm installation, below command should be returning a table if helm is installed successfully

helm list

2. Add Rocket Helm repository for Gpu Operator

Now, add the Rocket Helm repository for Gpu Operator:

helm repo add rocketgpu https://rocketsoftware.github.io/gpu-operator \
&& helm repo update

To validate repo addition, below command should list the added rocketgpu

helm repo list

3. Create an image pull secret to store the API key credentials

a) Create the secret for API Key obtained from Rocket to access the images in icr.io:

oc --namespace <project> create secret docker-registry icrlogin --docker-server=<registry_URL> --docker-username=iamapikey --docker-password=<api_key_value> --docker-email=<docker_email>

b) Verify the secret creation, below command gives list of secrets in the namespace. The name "icrlogin" must be present there.

oc get secrets --namespace <project>

Check if an image pull secret already exists for your default service account.

oc describe serviceaccount default -n <project_name>

When <none> is displayed in the Image pull secrets entry, no image pull secret exists.

d) Add the image pull secret to your default service account.

Example command to add the image pull secret when no image pull secret is defined.

oc patch -n <project_name> serviceaccount/default -p '{"imagePullSecrets":[{"name": "icrlogin"}]}'

Example command to add the image pull secret when an image pull secret is already defined.

oc patch -n <project_name> serviceaccount/default --type='json' -p='[{"op":"add","path":"/imagePullSecrets/-","value":{"name":"icrlogin"}}]'

e) Verify that your image pull secret was added to your default service account.

oc describe serviceaccount default -n <project_name>

Example output

Name: default

Namespace: <namespace_name>

Labels: <none>

Annotations: <none>

Image pull secrets: <image_pull_secret_name>

Mountable secrets: default-token-sh2dx

Tokens: default-token-sh2dx

Events: <none>

If the Image pull secrets says <secret> (not found), verify that the image pull secret exists in the same project as your service account by running oc get secrets -n project.

4. Installation of GPU Operator

Installation of GPU Operator can be done using the below command. This will use the default configurations.

helm install --wait --generate-name rocketgpu/gpu-operator

The GPU Operator Helm chart offers a number of customizable options that can be configured depending on your environment.

Chart Customization Options:

The following options are available when using the Helm chart. These options can be used with --set when installing via Helm.

Parameter	Description	Default
nfd.enabled	Deploys Node Feature Discovery plugin as a daemonset. Set this variable to false if NFD is already running in the cluster.	true
operator.defaultRuntime	By default, the operator assumes your Kubernetes deployment is running with docker as its container runtime. Other values are either crio (for CRI-O) or containerd (for containerd).	docker
mig.strategy	Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either mixed or single.	single
psp.enabled	The GPU operator deploys PodSecurityPolicies if enabled.	false
driver.enabled	By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to false when using the Operator on systems with pre-installed drivers.	true
driver.repository	The images are downloaded from NGC. Specify another image repository when using custom driver images.	nvcr.io/nvidia
driver.version	Version of the NVIDIA datacenter driver supported by the Operator.	Depends on the version of the Operator. See the Component Matrix for more information on supported drivers.
driver.rdma.enabled	Controls whether the driver daemonset should build and load the nvidia-peermem kernel module.	false
toolkit.enabled	By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a container on the system. Set this value to false when using the Operator on systems with pre-installed NVIDIA runtimes.	true
migManager.enabled	The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).	true

The above parameters can be used to customize the installation. For example if the user has already pre-installed NVIDIA drivers as part of the system image:

helm install --wait --generate-name rocketgpu/gpu-operator --set driver.enabled=false

In another scenario if you want to install gpu operator with specific version and use secrets to pull image

helm install --wait --generate-name \

rocketgpu/gpu-operator

--set driver.version=$VERSION \

--set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \

--set driver.licensingConfig.configMapName=licensing-config

5. Verify the successful installation of the NVIDIA GPU Operator

The commands below describe various ways to verify the successful installation of the NVIDIA GPU Operator.

oc get pods,daemonset -n gpu-operator-resources

NAME	READY	STATUS	RESTARTS	AGE
pod/gpu-feature-discovery-vwhnt	1/1	Running	0	6m32s
pod/nvidia-container-toolkit-daemonset-k8x28	1/1	Running	0	6m33s
pod/nvidia-cuda-validator-xr5sz	0/1	Completed	0	90s
pod/nvidia-dcgm-5grvn	1/1	Running	0	6m32s
pod/nvidia-dcgm-exporter-cp8ml	1/1	Running	0	6m32s
pod/nvidia-device-plugin-daemonset-p9dp4	1/1	Running	0	6m32s
pod/nvidia-device-plugin-validator-mrhst	0/1	Completed	0	48s
pod/nvidia-driver-daemonset-pbplc	1/1	Running	0	6m33s
pod/nvidia-node-status-exporter-s2ml2	1/1	Running	0	6m33s
pod/nvidia-operator-validator-44jdf	1/1	Running	0	6m32s

NAME	DESIRED	CURRENT	READY	UP-TO-DATE	AVAILABLE	NODE SELECTOR	AGE
daemonset.apps/gpu-feature-discovery	1	1	1	1	1	nvidia.com/gpu.deploy.gpu-feature-discovery=true	6m32s
daemonset.apps/nvidia-container-toolkit-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.container-toolkit=true	6m33s
daemonset.apps/nvidia-dcgm	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm=true	6m33s
daemonset.apps/nvidia-dcgm-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm-exporter=true	6m33s
daemonset.apps/nvidia-device-plugin-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.device-plugin=true	6m33s
daemonset.apps/nvidia-driver-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.driver=true	6m33s
daemonset.apps/nvidia-mig-manager	0	0	0	0	0	nvidia.com/gpu.deploy.mig-manager=true	6m32s
daemonset.apps/nvidia-node-status-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.node-status-exporter=true	6m34s
daemonset.apps/nvidia-operator-validator	1	1	1	1	1	nvidia.com/gpu.deploy.operator-validator=true	6m33s

------------------------------
Uvaise Ahamed
Rocket Internal - All Brands
------------------------------

4. RE: Installing GPU Operator in Openshift environment for ppc64le

ROCKETEER

Yeshwanth Kumar

Posted 12-06-2021 01:58

Can this gpu operator used in x86 machine?

------------------------------
Yeshwanth Kumar
Software Engineer
Rocket Forum Shared Account
------------------------------

Original Message

Original Message:
Sent: 11-12-2021 08:43
From: Uvaise Ahamed
Subject: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Before following the steps in this guide, ensure that your environment has:

A working OpenShift cluster up and running with a GPU worker node. See OpenShift Container Platform installation for guidance on installing. Refer to Container Platforms for the support matrix of the GPU Operator releases and the supported container platforms for more information.
Access to the OpenShift cluster as a cluster-admin to perform the necessary steps.
OpenShift CLI (oc) installed.
RedHat Enterprise Linux (RHEL) 8.X
Ensure that the appropriate Red Hat subscriptions and entitlements for OpenShift are properly enabled.
API Key to access images from icr.io. If you dont have one, request api key to access icr.io to pull images at following email address rocketce@rocketsoftware.com. For more details refer IBM Cloud Docs .

Deployment steps

The preferred method to deploy the GPU Operator is using helm.

1. Install Helm:

Below is the command to install helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 \

&& chmod 700 get_helm.sh \

&& ./get_helm.sh

To validate helm installation, below command should be returning a table if helm is installed successfully

helm list

2. Add Rocket Helm repository for Gpu Operator

Now, add the Rocket Helm repository for Gpu Operator:

helm repo add rocketgpu https://rocketsoftware.github.io/gpu-operator \
&& helm repo update

To validate repo addition, below command should list the added rocketgpu

helm repo list

3. Create an image pull secret to store the API key credentials

a) Create the secret for API Key obtained from Rocket to access the images in icr.io:

oc --namespace <project> create secret docker-registry icrlogin --docker-server=<registry_URL> --docker-username=iamapikey --docker-password=<api_key_value> --docker-email=<docker_email>

b) Verify the secret creation, below command gives list of secrets in the namespace. The name "icrlogin" must be present there.

oc get secrets --namespace <project>

Check if an image pull secret already exists for your default service account.

oc describe serviceaccount default -n <project_name>

When <none> is displayed in the Image pull secrets entry, no image pull secret exists.

d) Add the image pull secret to your default service account.

Example command to add the image pull secret when no image pull secret is defined.

oc patch -n <project_name> serviceaccount/default -p '{"imagePullSecrets":[{"name": "icrlogin"}]}'

Example command to add the image pull secret when an image pull secret is already defined.

oc patch -n <project_name> serviceaccount/default --type='json' -p='[{"op":"add","path":"/imagePullSecrets/-","value":{"name":"icrlogin"}}]'

e) Verify that your image pull secret was added to your default service account.

oc describe serviceaccount default -n <project_name>

Example output

Name: default

Namespace: <namespace_name>

Labels: <none>

Annotations: <none>

Image pull secrets: <image_pull_secret_name>

Mountable secrets: default-token-sh2dx

Tokens: default-token-sh2dx

Events: <none>

If the Image pull secrets says <secret> (not found), verify that the image pull secret exists in the same project as your service account by running oc get secrets -n project.

4. Installation of GPU Operator

Installation of GPU Operator can be done using the below command. This will use the default configurations.

helm install --wait --generate-name rocketgpu/gpu-operator

The GPU Operator Helm chart offers a number of customizable options that can be configured depending on your environment.

Chart Customization Options:

The following options are available when using the Helm chart. These options can be used with --set when installing via Helm.

Parameter	Description	Default
nfd.enabled	Deploys Node Feature Discovery plugin as a daemonset. Set this variable to false if NFD is already running in the cluster.	true
operator.defaultRuntime	By default, the operator assumes your Kubernetes deployment is running with docker as its container runtime. Other values are either crio (for CRI-O) or containerd (for containerd).	docker
mig.strategy	Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either mixed or single.	single
psp.enabled	The GPU operator deploys PodSecurityPolicies if enabled.	false
driver.enabled	By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to false when using the Operator on systems with pre-installed drivers.	true
driver.repository	The images are downloaded from NGC. Specify another image repository when using custom driver images.	nvcr.io/nvidia
driver.version	Version of the NVIDIA datacenter driver supported by the Operator.	Depends on the version of the Operator. See the Component Matrix for more information on supported drivers.
driver.rdma.enabled	Controls whether the driver daemonset should build and load the nvidia-peermem kernel module.	false
toolkit.enabled	By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a container on the system. Set this value to false when using the Operator on systems with pre-installed NVIDIA runtimes.	true
migManager.enabled	The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).	true

The above parameters can be used to customize the installation. For example if the user has already pre-installed NVIDIA drivers as part of the system image:

helm install --wait --generate-name rocketgpu/gpu-operator --set driver.enabled=false

In another scenario if you want to install gpu operator with specific version and use secrets to pull image

helm install --wait --generate-name \

rocketgpu/gpu-operator

--set driver.version=$VERSION \

--set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \

--set driver.licensingConfig.configMapName=licensing-config

5. Verify the successful installation of the NVIDIA GPU Operator

The commands below describe various ways to verify the successful installation of the NVIDIA GPU Operator.

oc get pods,daemonset -n gpu-operator-resources

NAME	READY	STATUS	RESTARTS	AGE
pod/gpu-feature-discovery-vwhnt	1/1	Running	0	6m32s
pod/nvidia-container-toolkit-daemonset-k8x28	1/1	Running	0	6m33s
pod/nvidia-cuda-validator-xr5sz	0/1	Completed	0	90s
pod/nvidia-dcgm-5grvn	1/1	Running	0	6m32s
pod/nvidia-dcgm-exporter-cp8ml	1/1	Running	0	6m32s
pod/nvidia-device-plugin-daemonset-p9dp4	1/1	Running	0	6m32s
pod/nvidia-device-plugin-validator-mrhst	0/1	Completed	0	48s
pod/nvidia-driver-daemonset-pbplc	1/1	Running	0	6m33s
pod/nvidia-node-status-exporter-s2ml2	1/1	Running	0	6m33s
pod/nvidia-operator-validator-44jdf	1/1	Running	0	6m32s

NAME	DESIRED	CURRENT	READY	UP-TO-DATE	AVAILABLE	NODE SELECTOR	AGE
daemonset.apps/gpu-feature-discovery	1	1	1	1	1	nvidia.com/gpu.deploy.gpu-feature-discovery=true	6m32s
daemonset.apps/nvidia-container-toolkit-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.container-toolkit=true	6m33s
daemonset.apps/nvidia-dcgm	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm=true	6m33s
daemonset.apps/nvidia-dcgm-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm-exporter=true	6m33s
daemonset.apps/nvidia-device-plugin-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.device-plugin=true	6m33s
daemonset.apps/nvidia-driver-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.driver=true	6m33s
daemonset.apps/nvidia-mig-manager	0	0	0	0	0	nvidia.com/gpu.deploy.mig-manager=true	6m32s
daemonset.apps/nvidia-node-status-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.node-status-exporter=true	6m34s
daemonset.apps/nvidia-operator-validator	1	1	1	1	1	nvidia.com/gpu.deploy.operator-validator=true	6m33s

------------------------------
Uvaise Ahamed
Rocket Internal - All Brands
------------------------------

5. RE: Installing GPU Operator in Openshift environment for ppc64le

ROCKETEER

Uvaise Ahamed

Posted 12-09-2021 02:39

The GPU operator distributed via RocketCE is exclusively to be installed in ppc64le machines. For x86 machines please follow "https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html"

------------------------------
Uvaise Ahamed
Rocket Internal - All Brands
------------------------------

Original Message

Original Message:
Sent: 12-06-2021 01:58
From: Yeshwanth Kumar
Subject: Installing GPU Operator in Openshift environment for ppc64le

Can this gpu operator used in x86 machine?

------------------------------
Yeshwanth Kumar
Software Engineer
Rocket Forum Shared Account

Original Message:
Sent: 11-12-2021 08:43
From: Uvaise Ahamed
Subject: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Before following the steps in this guide, ensure that your environment has:

A working OpenShift cluster up and running with a GPU worker node. See OpenShift Container Platform installation for guidance on installing. Refer to Container Platforms for the support matrix of the GPU Operator releases and the supported container platforms for more information.
Access to the OpenShift cluster as a cluster-admin to perform the necessary steps.
OpenShift CLI (oc) installed.
RedHat Enterprise Linux (RHEL) 8.X
Ensure that the appropriate Red Hat subscriptions and entitlements for OpenShift are properly enabled.
API Key to access images from icr.io. If you dont have one, request api key to access icr.io to pull images at following email address rocketce@rocketsoftware.com. For more details refer IBM Cloud Docs .

Deployment steps

The preferred method to deploy the GPU Operator is using helm.

1. Install Helm:

Below is the command to install helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 \

&& chmod 700 get_helm.sh \

&& ./get_helm.sh

To validate helm installation, below command should be returning a table if helm is installed successfully

helm list

2. Add Rocket Helm repository for Gpu Operator

Now, add the Rocket Helm repository for Gpu Operator:

helm repo add rocketgpu https://rocketsoftware.github.io/gpu-operator \
&& helm repo update

To validate repo addition, below command should list the added rocketgpu

helm repo list

3. Create an image pull secret to store the API key credentials

a) Create the secret for API Key obtained from Rocket to access the images in icr.io:

oc --namespace <project> create secret docker-registry icrlogin --docker-server=<registry_URL> --docker-username=iamapikey --docker-password=<api_key_value> --docker-email=<docker_email>

b) Verify the secret creation, below command gives list of secrets in the namespace. The name "icrlogin" must be present there.

oc get secrets --namespace <project>

Check if an image pull secret already exists for your default service account.

oc describe serviceaccount default -n <project_name>

When <none> is displayed in the Image pull secrets entry, no image pull secret exists.

d) Add the image pull secret to your default service account.

Example command to add the image pull secret when no image pull secret is defined.

oc patch -n <project_name> serviceaccount/default -p '{"imagePullSecrets":[{"name": "icrlogin"}]}'

Example command to add the image pull secret when an image pull secret is already defined.

oc patch -n <project_name> serviceaccount/default --type='json' -p='[{"op":"add","path":"/imagePullSecrets/-","value":{"name":"icrlogin"}}]'

e) Verify that your image pull secret was added to your default service account.

oc describe serviceaccount default -n <project_name>

Example output

Name: default

Namespace: <namespace_name>

Labels: <none>

Annotations: <none>

Image pull secrets: <image_pull_secret_name>

Mountable secrets: default-token-sh2dx

Tokens: default-token-sh2dx

Events: <none>

If the Image pull secrets says <secret> (not found), verify that the image pull secret exists in the same project as your service account by running oc get secrets -n project.

4. Installation of GPU Operator

Installation of GPU Operator can be done using the below command. This will use the default configurations.

helm install --wait --generate-name rocketgpu/gpu-operator

The GPU Operator Helm chart offers a number of customizable options that can be configured depending on your environment.

Chart Customization Options:

The following options are available when using the Helm chart. These options can be used with --set when installing via Helm.

Parameter	Description	Default
nfd.enabled	Deploys Node Feature Discovery plugin as a daemonset. Set this variable to false if NFD is already running in the cluster.	true
operator.defaultRuntime	By default, the operator assumes your Kubernetes deployment is running with docker as its container runtime. Other values are either crio (for CRI-O) or containerd (for containerd).	docker
mig.strategy	Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either mixed or single.	single
psp.enabled	The GPU operator deploys PodSecurityPolicies if enabled.	false
driver.enabled	By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to false when using the Operator on systems with pre-installed drivers.	true
driver.repository	The images are downloaded from NGC. Specify another image repository when using custom driver images.	nvcr.io/nvidia
driver.version	Version of the NVIDIA datacenter driver supported by the Operator.	Depends on the version of the Operator. See the Component Matrix for more information on supported drivers.
driver.rdma.enabled	Controls whether the driver daemonset should build and load the nvidia-peermem kernel module.	false
toolkit.enabled	By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a container on the system. Set this value to false when using the Operator on systems with pre-installed NVIDIA runtimes.	true
migManager.enabled	The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).	true

The above parameters can be used to customize the installation. For example if the user has already pre-installed NVIDIA drivers as part of the system image:

helm install --wait --generate-name rocketgpu/gpu-operator --set driver.enabled=false

In another scenario if you want to install gpu operator with specific version and use secrets to pull image

helm install --wait --generate-name \

rocketgpu/gpu-operator

--set driver.version=$VERSION \

--set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \

--set driver.licensingConfig.configMapName=licensing-config

5. Verify the successful installation of the NVIDIA GPU Operator

The commands below describe various ways to verify the successful installation of the NVIDIA GPU Operator.

oc get pods,daemonset -n gpu-operator-resources

NAME	READY	STATUS	RESTARTS	AGE
pod/gpu-feature-discovery-vwhnt	1/1	Running	0	6m32s
pod/nvidia-container-toolkit-daemonset-k8x28	1/1	Running	0	6m33s
pod/nvidia-cuda-validator-xr5sz	0/1	Completed	0	90s
pod/nvidia-dcgm-5grvn	1/1	Running	0	6m32s
pod/nvidia-dcgm-exporter-cp8ml	1/1	Running	0	6m32s
pod/nvidia-device-plugin-daemonset-p9dp4	1/1	Running	0	6m32s
pod/nvidia-device-plugin-validator-mrhst	0/1	Completed	0	48s
pod/nvidia-driver-daemonset-pbplc	1/1	Running	0	6m33s
pod/nvidia-node-status-exporter-s2ml2	1/1	Running	0	6m33s
pod/nvidia-operator-validator-44jdf	1/1	Running	0	6m32s

NAME	DESIRED	CURRENT	READY	UP-TO-DATE	AVAILABLE	NODE SELECTOR	AGE
daemonset.apps/gpu-feature-discovery	1	1	1	1	1	nvidia.com/gpu.deploy.gpu-feature-discovery=true	6m32s
daemonset.apps/nvidia-container-toolkit-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.container-toolkit=true	6m33s
daemonset.apps/nvidia-dcgm	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm=true	6m33s
daemonset.apps/nvidia-dcgm-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm-exporter=true	6m33s
daemonset.apps/nvidia-device-plugin-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.device-plugin=true	6m33s
daemonset.apps/nvidia-driver-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.driver=true	6m33s
daemonset.apps/nvidia-mig-manager	0	0	0	0	0	nvidia.com/gpu.deploy.mig-manager=true	6m32s
daemonset.apps/nvidia-node-status-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.node-status-exporter=true	6m34s
daemonset.apps/nvidia-operator-validator	1	1	1	1	1	nvidia.com/gpu.deploy.operator-validator=true	6m33s

------------------------------
Uvaise Ahamed
Rocket Internal - All Brands
------------------------------

6. RE: Installing GPU Operator in Openshift environment for ppc64le

ROCKETEER

Yeshwanth Kumar

Posted 12-06-2021 01:59

Can we use Kubernetes instead of openshift?

------------------------------
Yeshwanth Kumar
Software Engineer
Rocket Forum Shared Account
------------------------------

Original Message

Original Message:
Sent: 11-12-2021 08:43
From: Uvaise Ahamed
Subject: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Before following the steps in this guide, ensure that your environment has:

A working OpenShift cluster up and running with a GPU worker node. See OpenShift Container Platform installation for guidance on installing. Refer to Container Platforms for the support matrix of the GPU Operator releases and the supported container platforms for more information.
Access to the OpenShift cluster as a cluster-admin to perform the necessary steps.
OpenShift CLI (oc) installed.
RedHat Enterprise Linux (RHEL) 8.X
Ensure that the appropriate Red Hat subscriptions and entitlements for OpenShift are properly enabled.
API Key to access images from icr.io. If you dont have one, request api key to access icr.io to pull images at following email address rocketce@rocketsoftware.com. For more details refer IBM Cloud Docs .

Deployment steps

The preferred method to deploy the GPU Operator is using helm.

1. Install Helm:

Below is the command to install helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 \

&& chmod 700 get_helm.sh \

&& ./get_helm.sh

To validate helm installation, below command should be returning a table if helm is installed successfully

helm list

2. Add Rocket Helm repository for Gpu Operator

Now, add the Rocket Helm repository for Gpu Operator:

helm repo add rocketgpu https://rocketsoftware.github.io/gpu-operator \
&& helm repo update

To validate repo addition, below command should list the added rocketgpu

helm repo list

3. Create an image pull secret to store the API key credentials

a) Create the secret for API Key obtained from Rocket to access the images in icr.io:

oc --namespace <project> create secret docker-registry icrlogin --docker-server=<registry_URL> --docker-username=iamapikey --docker-password=<api_key_value> --docker-email=<docker_email>

b) Verify the secret creation, below command gives list of secrets in the namespace. The name "icrlogin" must be present there.

oc get secrets --namespace <project>

Check if an image pull secret already exists for your default service account.

oc describe serviceaccount default -n <project_name>

When <none> is displayed in the Image pull secrets entry, no image pull secret exists.

d) Add the image pull secret to your default service account.

Example command to add the image pull secret when no image pull secret is defined.

oc patch -n <project_name> serviceaccount/default -p '{"imagePullSecrets":[{"name": "icrlogin"}]}'

Example command to add the image pull secret when an image pull secret is already defined.

oc patch -n <project_name> serviceaccount/default --type='json' -p='[{"op":"add","path":"/imagePullSecrets/-","value":{"name":"icrlogin"}}]'

e) Verify that your image pull secret was added to your default service account.

oc describe serviceaccount default -n <project_name>

Example output

Name: default

Namespace: <namespace_name>

Labels: <none>

Annotations: <none>

Image pull secrets: <image_pull_secret_name>

Mountable secrets: default-token-sh2dx

Tokens: default-token-sh2dx

Events: <none>

If the Image pull secrets says <secret> (not found), verify that the image pull secret exists in the same project as your service account by running oc get secrets -n project.

4. Installation of GPU Operator

Installation of GPU Operator can be done using the below command. This will use the default configurations.

helm install --wait --generate-name rocketgpu/gpu-operator

The GPU Operator Helm chart offers a number of customizable options that can be configured depending on your environment.

Chart Customization Options:

The following options are available when using the Helm chart. These options can be used with --set when installing via Helm.

Parameter	Description	Default
nfd.enabled	Deploys Node Feature Discovery plugin as a daemonset. Set this variable to false if NFD is already running in the cluster.	true
operator.defaultRuntime	By default, the operator assumes your Kubernetes deployment is running with docker as its container runtime. Other values are either crio (for CRI-O) or containerd (for containerd).	docker
mig.strategy	Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either mixed or single.	single
psp.enabled	The GPU operator deploys PodSecurityPolicies if enabled.	false
driver.enabled	By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to false when using the Operator on systems with pre-installed drivers.	true
driver.repository	The images are downloaded from NGC. Specify another image repository when using custom driver images.	nvcr.io/nvidia
driver.version	Version of the NVIDIA datacenter driver supported by the Operator.	Depends on the version of the Operator. See the Component Matrix for more information on supported drivers.
driver.rdma.enabled	Controls whether the driver daemonset should build and load the nvidia-peermem kernel module.	false
toolkit.enabled	By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a container on the system. Set this value to false when using the Operator on systems with pre-installed NVIDIA runtimes.	true
migManager.enabled	The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).	true

The above parameters can be used to customize the installation. For example if the user has already pre-installed NVIDIA drivers as part of the system image:

helm install --wait --generate-name rocketgpu/gpu-operator --set driver.enabled=false

In another scenario if you want to install gpu operator with specific version and use secrets to pull image

helm install --wait --generate-name \

rocketgpu/gpu-operator

--set driver.version=$VERSION \

--set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \

--set driver.licensingConfig.configMapName=licensing-config

5. Verify the successful installation of the NVIDIA GPU Operator

The commands below describe various ways to verify the successful installation of the NVIDIA GPU Operator.

oc get pods,daemonset -n gpu-operator-resources

NAME	READY	STATUS	RESTARTS	AGE
pod/gpu-feature-discovery-vwhnt	1/1	Running	0	6m32s
pod/nvidia-container-toolkit-daemonset-k8x28	1/1	Running	0	6m33s
pod/nvidia-cuda-validator-xr5sz	0/1	Completed	0	90s
pod/nvidia-dcgm-5grvn	1/1	Running	0	6m32s
pod/nvidia-dcgm-exporter-cp8ml	1/1	Running	0	6m32s
pod/nvidia-device-plugin-daemonset-p9dp4	1/1	Running	0	6m32s
pod/nvidia-device-plugin-validator-mrhst	0/1	Completed	0	48s
pod/nvidia-driver-daemonset-pbplc	1/1	Running	0	6m33s
pod/nvidia-node-status-exporter-s2ml2	1/1	Running	0	6m33s
pod/nvidia-operator-validator-44jdf	1/1	Running	0	6m32s

NAME	DESIRED	CURRENT	READY	UP-TO-DATE	AVAILABLE	NODE SELECTOR	AGE
daemonset.apps/gpu-feature-discovery	1	1	1	1	1	nvidia.com/gpu.deploy.gpu-feature-discovery=true	6m32s
daemonset.apps/nvidia-container-toolkit-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.container-toolkit=true	6m33s
daemonset.apps/nvidia-dcgm	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm=true	6m33s
daemonset.apps/nvidia-dcgm-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm-exporter=true	6m33s
daemonset.apps/nvidia-device-plugin-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.device-plugin=true	6m33s
daemonset.apps/nvidia-driver-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.driver=true	6m33s
daemonset.apps/nvidia-mig-manager	0	0	0	0	0	nvidia.com/gpu.deploy.mig-manager=true	6m32s
daemonset.apps/nvidia-node-status-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.node-status-exporter=true	6m34s
daemonset.apps/nvidia-operator-validator	1	1	1	1	1	nvidia.com/gpu.deploy.operator-validator=true	6m33s

------------------------------
Uvaise Ahamed
Rocket Internal - All Brands
------------------------------

7. RE: Installing GPU Operator in Openshift environment for ppc64le

ROCKETEER

Uvaise Ahamed

Posted 12-09-2021 02:41

We support GPU operator on OpenShift. You can try installation on kubernetes by following kubernetes specific documentation using https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html

------------------------------
Uvaise Ahamed
Rocket Internal - All Brands
------------------------------

Original Message

Original Message:
Sent: 12-06-2021 01:58
From: Yeshwanth Kumar
Subject: Installing GPU Operator in Openshift environment for ppc64le

Can we use Kubernetes instead of openshift?

------------------------------
Yeshwanth Kumar
Software Engineer
Rocket Forum Shared Account

Original Message:
Sent: 11-12-2021 08:43
From: Uvaise Ahamed
Subject: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Before following the steps in this guide, ensure that your environment has:

A working OpenShift cluster up and running with a GPU worker node. See OpenShift Container Platform installation for guidance on installing. Refer to Container Platforms for the support matrix of the GPU Operator releases and the supported container platforms for more information.
Access to the OpenShift cluster as a cluster-admin to perform the necessary steps.
OpenShift CLI (oc) installed.
RedHat Enterprise Linux (RHEL) 8.X
Ensure that the appropriate Red Hat subscriptions and entitlements for OpenShift are properly enabled.
API Key to access images from icr.io. If you dont have one, request api key to access icr.io to pull images at following email address rocketce@rocketsoftware.com. For more details refer IBM Cloud Docs .

Deployment steps

The preferred method to deploy the GPU Operator is using helm.

1. Install Helm:

Below is the command to install helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 \

&& chmod 700 get_helm.sh \

&& ./get_helm.sh

To validate helm installation, below command should be returning a table if helm is installed successfully

helm list

2. Add Rocket Helm repository for Gpu Operator

Now, add the Rocket Helm repository for Gpu Operator:

helm repo add rocketgpu https://rocketsoftware.github.io/gpu-operator \
&& helm repo update

To validate repo addition, below command should list the added rocketgpu

helm repo list

3. Create an image pull secret to store the API key credentials

a) Create the secret for API Key obtained from Rocket to access the images in icr.io:

oc --namespace <project> create secret docker-registry icrlogin --docker-server=<registry_URL> --docker-username=iamapikey --docker-password=<api_key_value> --docker-email=<docker_email>

b) Verify the secret creation, below command gives list of secrets in the namespace. The name "icrlogin" must be present there.

oc get secrets --namespace <project>

Check if an image pull secret already exists for your default service account.

oc describe serviceaccount default -n <project_name>

When <none> is displayed in the Image pull secrets entry, no image pull secret exists.

d) Add the image pull secret to your default service account.

Example command to add the image pull secret when no image pull secret is defined.

oc patch -n <project_name> serviceaccount/default -p '{"imagePullSecrets":[{"name": "icrlogin"}]}'

Example command to add the image pull secret when an image pull secret is already defined.

oc patch -n <project_name> serviceaccount/default --type='json' -p='[{"op":"add","path":"/imagePullSecrets/-","value":{"name":"icrlogin"}}]'

e) Verify that your image pull secret was added to your default service account.

oc describe serviceaccount default -n <project_name>

Example output

Name: default

Namespace: <namespace_name>

Labels: <none>

Annotations: <none>

Image pull secrets: <image_pull_secret_name>

Mountable secrets: default-token-sh2dx

Tokens: default-token-sh2dx

Events: <none>

If the Image pull secrets says <secret> (not found), verify that the image pull secret exists in the same project as your service account by running oc get secrets -n project.

4. Installation of GPU Operator

Installation of GPU Operator can be done using the below command. This will use the default configurations.

helm install --wait --generate-name rocketgpu/gpu-operator

The GPU Operator Helm chart offers a number of customizable options that can be configured depending on your environment.

Chart Customization Options:

The following options are available when using the Helm chart. These options can be used with --set when installing via Helm.

Parameter	Description	Default
nfd.enabled	Deploys Node Feature Discovery plugin as a daemonset. Set this variable to false if NFD is already running in the cluster.	true
operator.defaultRuntime	By default, the operator assumes your Kubernetes deployment is running with docker as its container runtime. Other values are either crio (for CRI-O) or containerd (for containerd).	docker
mig.strategy	Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either mixed or single.	single
psp.enabled	The GPU operator deploys PodSecurityPolicies if enabled.	false
driver.enabled	By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to false when using the Operator on systems with pre-installed drivers.	true
driver.repository	The images are downloaded from NGC. Specify another image repository when using custom driver images.	nvcr.io/nvidia
driver.version	Version of the NVIDIA datacenter driver supported by the Operator.	Depends on the version of the Operator. See the Component Matrix for more information on supported drivers.
driver.rdma.enabled	Controls whether the driver daemonset should build and load the nvidia-peermem kernel module.	false
toolkit.enabled	By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a container on the system. Set this value to false when using the Operator on systems with pre-installed NVIDIA runtimes.	true
migManager.enabled	The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).	true

The above parameters can be used to customize the installation. For example if the user has already pre-installed NVIDIA drivers as part of the system image:

helm install --wait --generate-name rocketgpu/gpu-operator --set driver.enabled=false

In another scenario if you want to install gpu operator with specific version and use secrets to pull image

helm install --wait --generate-name \

rocketgpu/gpu-operator

--set driver.version=$VERSION \

--set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \

--set driver.licensingConfig.configMapName=licensing-config

5. Verify the successful installation of the NVIDIA GPU Operator

The commands below describe various ways to verify the successful installation of the NVIDIA GPU Operator.

oc get pods,daemonset -n gpu-operator-resources

NAME	READY	STATUS	RESTARTS	AGE
pod/gpu-feature-discovery-vwhnt	1/1	Running	0	6m32s
pod/nvidia-container-toolkit-daemonset-k8x28	1/1	Running	0	6m33s
pod/nvidia-cuda-validator-xr5sz	0/1	Completed	0	90s
pod/nvidia-dcgm-5grvn	1/1	Running	0	6m32s
pod/nvidia-dcgm-exporter-cp8ml	1/1	Running	0	6m32s
pod/nvidia-device-plugin-daemonset-p9dp4	1/1	Running	0	6m32s
pod/nvidia-device-plugin-validator-mrhst	0/1	Completed	0	48s
pod/nvidia-driver-daemonset-pbplc	1/1	Running	0	6m33s
pod/nvidia-node-status-exporter-s2ml2	1/1	Running	0	6m33s
pod/nvidia-operator-validator-44jdf	1/1	Running	0	6m32s

NAME	DESIRED	CURRENT	READY	UP-TO-DATE	AVAILABLE	NODE SELECTOR	AGE
daemonset.apps/gpu-feature-discovery	1	1	1	1	1	nvidia.com/gpu.deploy.gpu-feature-discovery=true	6m32s
daemonset.apps/nvidia-container-toolkit-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.container-toolkit=true	6m33s
daemonset.apps/nvidia-dcgm	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm=true	6m33s
daemonset.apps/nvidia-dcgm-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.dcgm-exporter=true	6m33s
daemonset.apps/nvidia-device-plugin-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.device-plugin=true	6m33s
daemonset.apps/nvidia-driver-daemonset	1	1	1	1	1	nvidia.com/gpu.deploy.driver=true	6m33s
daemonset.apps/nvidia-mig-manager	0	0	0	0	0	nvidia.com/gpu.deploy.mig-manager=true	6m32s
daemonset.apps/nvidia-node-status-exporter	1	1	1	1	1	nvidia.com/gpu.deploy.node-status-exporter=true	6m34s
daemonset.apps/nvidia-operator-validator	1	1	1	1	1	nvidia.com/gpu.deploy.operator-validator=true	6m33s

------------------------------
Uvaise Ahamed
Rocket Internal - All Brands
------------------------------

RocketCE for Power

Installing GPU Operator in Openshift environment for ppc64le

Uvaise Ahamed11-12-2021 08:52

Suganya Koteeswaran12-01-2021 07:06

Suyog Jadhav12-03-2021 07:58

Yeshwanth Kumar12-06-2021 01:58

Uvaise Ahamed12-09-2021 02:39

Yeshwanth Kumar12-06-2021 01:59

Uvaise Ahamed12-09-2021 02:41

1. Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Verification of Prerequisites

1. Verify entitlements

2. Verify Node Feature Discovery (NFD) operator is installed

Deployment steps

1. Create Project:

2. Install Helm:

3. Add Rocket Helm repository for Gpu Operator

4. Create an image pull secret to store the API key credentials

5. Installation of GPU Operator

6. Verify the successful installation of the NVIDIA GPU Operator

7. Running Sample GPU Applications using a sample pod

8. Running Sample GPU Applications using a sample deployment

9. Uninstall

2. RE: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Deployment steps

1. Install Helm:

2. Add Rocket Helm repository for Gpu Operator

3. Create an image pull secret to store the API key credentials

4. Installation of GPU Operator

5. Verify the successful installation of the NVIDIA GPU Operator

3. RE: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Deployment steps

1. Install Helm:

2. Add Rocket Helm repository for Gpu Operator

3. Create an image pull secret to store the API key credentials

4. Installation of GPU Operator

5. Verify the successful installation of the NVIDIA GPU Operator

4. RE: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Deployment steps

1. Install Helm:

2. Add Rocket Helm repository for Gpu Operator

3. Create an image pull secret to store the API key credentials

4. Installation of GPU Operator

5. Verify the successful installation of the NVIDIA GPU Operator

5. RE: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Deployment steps

1. Install Helm:

2. Add Rocket Helm repository for Gpu Operator

3. Create an image pull secret to store the API key credentials

4. Installation of GPU Operator

5. Verify the successful installation of the NVIDIA GPU Operator

6. RE: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Deployment steps

1. Install Helm:

2. Add Rocket Helm repository for Gpu Operator

3. Create an image pull secret to store the API key credentials

4. Installation of GPU Operator

5. Verify the successful installation of the NVIDIA GPU Operator

7. RE: Installing GPU Operator in Openshift environment for ppc64le

Introduction

Prerequisites

Deployment steps

1. Install Helm:

2. Add Rocket Helm repository for Gpu Operator

3. Create an image pull secret to store the API key credentials

4. Installation of GPU Operator

5. Verify the successful installation of the NVIDIA GPU Operator

Contact Us