RocketCE for Power

 View Only

Releasing GPU Operator version 1.9.1 for ppc64le in Openshift

  • 1.  Releasing GPU Operator version 1.9.1 for ppc64le in Openshift

    ROCKETEER
    Posted 06-03-2022 05:58

    Introduction


    Red Hat OpenShift Container Platform is a security-centric and enterprise-grade hardened Kubernetes platform for deploying and managing Kubernetes clusters at scale, developed and supported by Red Hat. Red Hat OpenShift Container Platform includes enhancements to Kubernetes so users can easily configure and use GPU resources for accelerating workloads such as deep learning.

    The GPU operator manages NVIDIA GPU resources in a Openshift cluster and automates tasks related to bootstrapping GPU nodes. Since the GPU is a special resource in the cluster, it requires a few components to be installed before application workloads can be deployed onto the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin, container runtime and others such as automatic node labelling, monitoring etc. 

    The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, automatic node labelling using GFDDCGM based monitoring and others.

    The version 1.9.1 of GPU Operator for ppc64le machines has been released which is to be run in Openshift Cluster environment. This will bring in new features and some bug fixes which are mentioned below.
    To know about the installation steps please scroll below the section of features / fixes.

    This document describes the new features, improvements, fixed and known issues for the NVIDIA GPU Operator for ppc64le  from version 1.7.0 to 1.9.1

    See the Component Matrix for a list of components included in each release.

    Version 1.9.1 -

    Improvements
    • Improved logic in the driver container for waiting on MOFED driver readiness. This ensures that nvidia-peermem is built and installed correctly.
     Fixed issues
    • Allow driver container to fallback to using cluster entitlements on Red Hat OpenShift on build failures. This issue exposed itself when using GPU Operator with some Red Hat OpenShift 4.8.z versions and Red Hat OpenShift 4.9.8. GPU Operator 1.9+ with Red Hat OpenShift 4.9.9+ doesn't require entitlements.
    • Fixed an issue when DCGM-Exporter didn't work correctly with using the separate DCGM host engine that is part of the standalone DCGM pod. Fixed the issue and changed the default behavior to use the DCGM Host engine that is embedded in DCGM-Exporter. The standalone DCGM pod will not be launched by default but can be enabled for use with DGX A100.
    • Update to latest Go vendor packages to avoid any CVE's.
    • Fixed an issue to allow GPU Operator to work with CRI-O runtime on Kubernetes.
    • Mount correct source path for Mellanox OFED 5.x drivers for enabling GPUDirect RDMA.

    Version 1.9.0 -

    Features:
    • Support for NVIDIA Data Center GPU Driver version 470.82.01.
    • Support for DGX A100 with DGX OS 5.1+.
    • Support for preinstalled GPU Driver with MIG Manager.
    • Removed dependency to maintain active Red Hat OpenShift entitlements to build the GPU Driver. Introduce entitlement free driver builds starting with Red Hat OpenShift 4.9.9.
    • Support for GPUDirect RDMA with preinstalled Mellanox OFED drivers.
    • Support for GPU Operator and operands upgrades using Red Hat OpenShift Lifecycle Manager (OLM).
    • Support for NVIDIA Virtual Compute Server 13.1 (vGPU).
    Improvements
    • Automatic detection of default runtime used in the cluster. Deprecate the operator.defaultRuntime parameter.
    • GPU Operator and its operands are installed into a single user specified namespace.
    • A loaded Nouveau driver is automatically detected and unloaded as part of the GPU Operator install.
    • Added an option to mount a ConfigMap of self-signed certificates into the driver container. Enables SSL connections to private package repositories.
    Fixed issues
    • Fixed an issue when DCGM Exporter was in CrashLoopBackOff as it could not connect to the DCGM port on the same node.
    Limitations
    • GPUDirect RDMA is only supported with R470 drivers on Ubuntu 20.04 LTS and is not supported on other distributions (e.g. CoreOS, CentOS etc.)
    • The GPU Operator supports GPUDirect RDMA only in conjunction with the Network Operator. The Mellanox OFED drivers can be installed by the Network Operator or pre-installed on the host.
    • Upgrades from v1.8.x to v1.9.x are not supported due to GPU Operator 1.9 installing the GPU Operator and its operands into a single namespace. Previous GPU Operator versions installed them into different namespaces. Upgrading to GPU Operator 1.9 requires uninstalling pre 1.9 GPU Operator versions prior to installing GPU Operator 1.9
    • Collection of GPU metrics in MIG mode is not supported with 470+ drivers.
    • The GPU Operator requires all MIG related configurations to be executed by MIG Manager. Enabling/Disabling MIG and other MIG related configurations directly on the host is discouraged.
    • Fabric Manager (required for NVSwitch based systems) with CentOS 7 is not supported.

    Version 1.8.2 & 1.8.1 -

    Improvements:
    • Added support for user-defined MIG partition configuration via a ConfigMap
    Fixed Issues:
    • Fixed an issue with using the NVIDIA License System in NVIDIA AI Enterprise deployments.
    • Fixed an issue where Driver Daemonset was spuriously updated on RedHat OpenShift causing repeated restarts in Proxy environments.
    • The MIG Manager version was bumped to v0.1.3 to fix an issue when checking whether a GPU was in MIG mode or not. Previously, it would always check for MIG mode directly over the PCIe bus instead of using NVML. Now it checks with NVML when it can, only falling back to the PCIe bus when NVML is not available. Please refer to the Release notes for a complete list of fixed issues.
    • Container Toolkit bumped to version v1.7.1 to fix an issue when using A100 80GB.

    Version 1.8.0 -

    Features:
    • NVIDIA Data Center GPU Driver version 470.57.02 ( Previously in 1.7.0 - 460.73.01)
    • Added support for NVSwitch systems such as HGX A100. The driver container detects the presence of NVSwitches in the system and automatically deploys the Fabric Manager for setting up the NVSwitch fabric.
    • The driver container now builds and loads the nvidia-peermem kernel module when GPUDirect RDMA is enabled and Mellanox devices are present in the system
    • Added support for upgrades of the GPU Operator components. A new k8s-driver-manager component handles upgrades of the NVIDIA drivers on nodes in the cluster.
    • Added a nodeStatusExporter component that exports operator and node metrics in a Prometheus format.
    • NVIDIA DCGM is now deployed as a component of the GPU Operator
    Improvements:
    • Reduced the size of the ClusterPolicy CRD by removing duplicates and redundant fields.
    • The GPU Operator now supports detection of the virtual PCIe topology of the system and makes the topology available to vGPU drivers via a configuration file. The driver container starts the nvidia-topologyd daemon in vGPU configurations.
    • Added support for specifying the 'RuntimeClass' variable via Helm.
    • Added nvidia-container-toolkit images to support CentOS 7 and CentOS 8.
    • nvidia-container-toolkit now supports configuring containerd correctly for K3s.
    • Added new debug options (logging, verbosity levels) for nvidia-container-toolkit
    Limitations:
    • GPUDirect RDMA is only supported with R470 drivers on Ubuntu 20.04 LTS and is not supported on other distributions (e.g. CoreOS, CentOS etc.)
    • The operator supports building and loading of nvidia-peermem only in conjunction with the Network Operator. Use with pre-installed MOFED drivers on the host is not supported. This capability will be added in a future release.
    • Support for DGX A100 with GPU Operator 1.8 will be available in an upcoming patch release.
    • This version of GPU Operator does not work well on RedHat OpenShift when a cluster-wide proxy is configured and causes constant restarts of driver container. This will be fixed in an upcoming patch release v1.8.2.

    Installation Steps :

    Prerequisites

    Before following the steps in this guide, ensure that your environment has:

    • A working OpenShift cluster up and running with a GPU worker node. See OpenShift Container Platform installation for guidance on installing. Refer to Container Platforms for the support matrix of the GPU Operator releases and the supported container platforms for more information.
    • Access to the OpenShift cluster as a cluster-admin to perform the necessary steps.
    • OpenShift CLI (oc) installed.
    • RedHat Enterprise Linux (RHEL) 8.X 
    • Ensure that the appropriate Red Hat subscriptions and entitlements for OpenShift are properly enabled. For more details on enabling entitlements please refer Step 1 and Step 2 in this page
    • Install the Node Feature Discovery (NFD) operator using Step 3 of this page
    • API Key to access images from icr.io. If you dont have one, request api key to access icr.io to pull images at following email address rocketce@rocketsoftware.com. For more details refer IBM Cloud Docs .

    Verification of Prerequisites

    1. Verify entitlements

    # oc get machineconfig | grep entitlement
    50-entitlement-key-pem                                                                                       2.2.0             4d1h
    50-entitlement-pem                                                                                           2.2.0             4d1h

    # oc get mcp 
    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-30627edc9bc847e2f6a7a9755561f65c   True      False      False      3              3                   3                     0                      24d
    worker   rendered-worker-8d10dd8f6569ba1505d831e95a6b7d6c   True      False      False      4              4                   4                     0                      24d

    2. Verify Node Feature Discovery (NFD) operator is installed

    NFD can be validated by checking in web console under the menu Operators -> Installed Operators. 
    Also the below command can be run to check using CLI.

    # oc get deployment -n openshift-operators
    NAME READY UP-TO-DATE AVAILABLE AGE
    nfd-controller-manager 1/1 1 1 2d22h

    Deployment steps

    The preferred method to deploy the GPU Operator is using helm. 

    1. Create Project: 

              Create a project using Openshift console which will be used to create the resources of Gpu operator

    2. Install Helm: 

              Below is the command to install helm 

     && chmod 700 get_helm.sh \ 
     && ./get_helm.sh

                 To validate helm installation, below command should be returning a table if helm is installed successfully 

    helm list 

     3. Add Rocket Helm repository for Gpu Operator

                   Now, add the Rocket Helm repository for Gpu Operator:    

    helm repo add rocketgpu https://rocketsoftware.github.io/gpu-operator \ 
    && helm repo update


                   To validate repo addition, below command should list the added rocketgpu 

    helm repo list 

    4. Create an image pull secret to store the API key credentials

    a)  Create the secret for API Key obtained from Rocket to access the images in icr.io:    
    oc --namespace <project> create secret docker-registry icrlogin --docker-server=<registry_URL> --docker-username=iamapikey --docker-password=<api_key_value> --docker-email=<docker_email>


    b) Verify the secret creation, below command gives list of secrets in the namespace. The name "icrlogin" must be present there.
    oc get secrets --namespace <project>


    c) Store the image pull secret in the Kubernetes service account for the selected project. Every OpenShift project has a Kubernetes service account that is named default. Within the project, you can add the image pull secret to this service account to grant access for pods to pull images from your registry. Deployments that do not specify a service account automatically use the default service account for this OpenShift project.

    Check if an image pull secret already exists for your default service account.
    oc describe serviceaccount default -n <project_name>

    When <none> is displayed in the Image pull secrets entry, no image pull secret exists.

    d) Add the image pull secret to your default service account.

    Example command to add the image pull secret when no image pull secret is defined.
    oc patch -n <project_name> serviceaccount/default -p '{"imagePullSecrets":[{"name": "icrlogin"}]}'

    Example command to add the image pull secret when an image pull secret is already defined.
    oc patch -n <project_name> serviceaccount/default --type='json' -p='[{"op":"add","path":"/imagePullSecrets/-","value":{"name":"icrlogin"}}]'


    e) Verify that your image pull secret was added to your default service account.
    oc describe serviceaccount default -n <project_name>

    Example output

    Name: default
    Namespace: <namespace_name>
    Labels: <none>
    Annotations: <none>
    Image pull secrets: <image_pull_secret_name>
    Mountable secrets: default-token-sh2dx
    Tokens: default-token-sh2dx
    Events: <none>

    If the Image pull secrets says <secret> (not found), verify that the image pull secret exists in the same project as your service account by running oc get secrets -n project.

    5. Installation of GPU Operator

                    Installation of GPU Operator can be done using the below command. This will use the default configurations. 

    helm install --wait --generate-name rocketgpu/gpu-operator -n <project_name> 

    The GPU Operator Helm chart offers a number of customizable options that can be configured depending on your environment. 

     Chart Customization Options: 

    The following options are available when using the Helm chart. These options can be used with --set when installing via Helm. 

    Parameter 

    Description 

    Default 

    nfd.enabled 

    Deploys Node Feature Discovery plugin as a daemonset. Set this variable to false if NFD is already running in the cluster. 

    true 

    operator.defaultRuntime 

    By default, the operator assumes your Kubernetes deployment is running with docker as its container runtime. Other values are either crio (for CRI-O) or containerd (for containerd). 

    docker 

    mig.strategy 

    Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either mixed or single. 

    single 

    psp.enabled 

    The GPU operator deploys PodSecurityPolicies if enabled. 

    false 

    driver.enabled 

    By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to false when using the Operator on systems with pre-installed drivers. 

    true 

    driver.repository 

    The images are downloaded from NGC. Specify another image repository when using custom driver images. 

    icr.io/rocketce

    driver.version 

    Version of the NVIDIA datacenter driver supported by the Operator. 

    Depends on the version of the Operator. See the Component Matrix for more information on supported drivers. 

    driver.rdma.enabled 

    Controls whether the driver daemonset should build and load the nvidia-peermem kernel module. 

    false 

    toolkit.enabled 

    By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a container on the system. Set this value to false when using the Operator on systems with pre-installed NVIDIA runtimes. 

    true 

    migManager.enabled 

    The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100). 

    true 

     
     The above parameters can be used to customize the installation. For example if the user has already pre-installed NVIDIA drivers as part of the system image: 

    helm install --wait --generate-name rocketgpu/gpu-operator --set driver.enabled=false 

      In another scenario if you want to install gpu operator with specific version and use secrets to pull image 

    helm install --wait --generate-name \ 
        rocketgpu/gpu-operator  
         --set driver.version=$VERSION \ 
         --set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \ 
         --set driver.licensingConfig.configMapName=licensing-config 

     

    6. Verify the successful installation of the NVIDIA GPU Operator

    The commands below describe various ways to verify the successful installation of the NVIDIA GPU Operator.

    oc get pods,daemonset -n gpu-operator-resources

     

    NAME

    READY

    STATUS

    RESTARTS

    AGE

    pod/gpu-feature-discovery-vwhnt

    1/1

    Running

    0

    6m32s

    pod/nvidia-container-toolkit-daemonset-k8x28

    1/1

    Running

    0

    6m33s

    pod/nvidia-cuda-validator-xr5sz

    0/1

    Completed

    0

    90s

    pod/nvidia-dcgm-5grvn

    1/1

    Running

    0

    6m32s

    pod/nvidia-dcgm-exporter-cp8ml

    1/1

    Running

    0

    6m32s

    pod/nvidia-device-plugin-daemonset-p9dp4

    1/1

    Running

    0

    6m32s

    pod/nvidia-device-plugin-validator-mrhst

    0/1

    Completed

    0

    48s

    pod/nvidia-driver-daemonset-pbplc

    1/1

    Running

    0

    6m33s

    pod/nvidia-node-status-exporter-s2ml2

    1/1

    Running

    0

    6m33s

    pod/nvidia-operator-validator-44jdf

    1/1

    Running

    0

    6m32s

     

    NAME

    DESIRED

    CURRENT

    READY

    UP-TO-DATE

    AVAILABLE

    NODE SELECTOR

    AGE

    daemonset.apps/gpu-feature-discovery

    1

    1

    1

    1

    1

    nvidia.com/gpu.deploy.gpu-feature-discovery=true

    6m32s

    daemonset.apps/nvidia-container-toolkit-daemonset

    1

    1

    1

    1

    1

    nvidia.com/gpu.deploy.container-toolkit=true

    6m33s

    daemonset.apps/nvidia-dcgm

    1

    1

    1

    1

    1

    nvidia.com/gpu.deploy.dcgm=true

    6m33s

    daemonset.apps/nvidia-dcgm-exporter

    1

    1

    1

    1

    1

    nvidia.com/gpu.deploy.dcgm-exporter=true

    6m33s

    daemonset.apps/nvidia-device-plugin-daemonset

    1

    1

    1

    1

    1

    nvidia.com/gpu.deploy.device-plugin=true

    6m33s

    daemonset.apps/nvidia-driver-daemonset

    1

    1

    1

    1

    1

    nvidia.com/gpu.deploy.driver=true

    6m33s

    daemonset.apps/nvidia-mig-manager

    0

    0

    0

    0

    0

    nvidia.com/gpu.deploy.mig-manager=true

    6m32s

    daemonset.apps/nvidia-node-status-exporter

    1

    1

    1

    1

    1

    nvidia.com/gpu.deploy.node-status-exporter=true

    6m34s

    daemonset.apps/nvidia-operator-validator

    1

    1

    1

    1

    1

    nvidia.com/gpu.deploy.operator-validator=true

    6m33s


    7. Running Sample GPU Applications using a sample pod

    Create a sample pod as mentioned below and validate the successful installation of Gpu operator.

    cat<< EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
        name: gpu-operator-test
    spec:
        restartPolicy: OnFailure
        containers:
        - name: cuda-vector-add
            image: "quay.io/mgiessing/cuda-sample:vectoradd-cuda11.2.1-ubi8"
            resources:
                limits:
                    nvidia.com/gpu: 1
    EOF


    8. Running Sample GPU Applications using a sample deployment

    We can apply a sample deployment file and verify the logs after the pod is in running state.

    [root@cp4d-2-bastion sample_ppc64le]# cat cuda-demo1.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: cuda-demo
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: cuda-demo
      template:
        metadata:
          labels:
              app: cuda-demo
        spec:
          selector:
            matchLabels:
              app: cuda-demo
          containers:
            - name: cuda-demo
              image: nvidia/cuda-ppc64le:11.4.0-runtime
              command: ["/bin/sh", "-c"]
              args: ["nvidia-smi && tail -f /dev/null"]
              resources:
                limits:
                  nvidia.com/gpu: 1



    9. Uninstall

    To uninstall the operator:

    helm delete -n gpu-operator $(helm list -n gpu-operator | grep gpu-operator | awk '{print $1}')

    You should now see all the pods being deleted:

    oc get pods -n gpu-operator-resources
    No resources found.

    Also, ensure that CRDs created during the operator install have been removed:

    oc get crds -A | grep -i clusterpolicies.nvidia.com
    and
    oc delete crd clusterpolicies.nvidia.com

    Note:
    After un-install of GPU Operator, the NVIDIA driver modules might still be loaded. Either reboot the node or unload them using the following command:

    sudo rmmod nvidia_modeset nvidia_uvm nvidia




    ------------------------------
    Uvaise Ahamed
    Rocket Internal - All Brands
    ------------------------------