RocketCE for Power

 View Only
  • 1.  Test WML CE in OpenShift

    ROCKETEER
    Posted 09-16-2021 13:12
    Edited by Tim Hill 11-04-2021 13:23

    WML CE in OpenShift

    Follow these steps to deploy IBM Watson® Machine Learning, Distributed Deep Learning (DDL) directly into your enterprise private cloud with Red Hat OpenShift 3.x by using TCP or InfiniBand communication between the worker nodes.

    Before you begin

    Install the OpenShift and Helm CLI to deploy your application from the command line. After installing the CLI, add the IBM® Helm Chart repository.

    Refer to the wmlce-openshift README for more details

    Deploying WML CE DDL with TCP cross node communication

    1. Create container SSH keys as a Kubernetes secret.
      mkdir -p .tmp
      yes | ssh-keygen -N "" -f .tmp/id_rsa
      oc create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub
    2. Deploy the WML CE OpenShift Helm Chart with DDL enabled:
      helm install --name ddl-instance --set license=accept Path_of_Chart --set resources.gpu=8 --set paiDistributed.mode=true 
      --set paiDistributed.sshKeySecret=sshkeys-secret
      name
      The deployment name.
      set resources.gpu
      The total number of requested GPUs.
      set paiDistributed.mode
      Enable Distributed Deep Learning.
      set paiDistributed.sshKeySecret
      Name of the Kubernetes secret containing the SSH keys.
    3. Verify that the pods were created and wait until they are in a running and ready state. One pod per worker node is created. DDL deployments always take all the GPUs of a node. Run kubectl describe pod pod_name to get more information about a pod.
      oc get pod -l app=ddl-instance-ibm-wmlce
      NAME                         READY     STATUS    RESTARTS   AGE
      ddl-instance-ibm-wmlce-0   1/1       Running   0          30s
      ddl-instance-ibm-wmlce-1   1/1       Running   0          30s
    4. Train the model with DDL:
      ddlrun --mpiarg '-mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0' --tcp --hostfile /wmlce/config/hostfile 
      python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl
      mpiarg
      The network interface to use for MPI and NCCL. In this example, eth0 connects the different nodes.
      hostfile
      Use the auto generated host file available inside the pod.
      The run output should display the IBM Corp: DDL banner and for this model, the total images/sec.
      I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
      ...
      ----------------------------------------------------------------
      total images/sec: 2284.62
      ----------------------------------------------------------------
    5. Delete your deployment.
      helm delete ddl-instance --purge

    Using the host network on a container

    The Helm Chart provides the option to use the host network for communication to get better performance. The potential disadvantage of this option is that all the host network interfaces will be visible inside the container. The host SSH uses port 22, so a different port must be selected when using this option. The following steps are an example of deploying and running with the host network:

    1. Deploy the WML CE Helm Chart:
      helm install --name ddl-instance --set license=accept Path_of_Chart --set resources.gpu=8 
      --set paiDistributed.mode=true --set paiDistributed.sshKeySecret=sshkeys-secret --set paiDistributed.useHostNetwork=true 
      --set paiDistributed.sshPort=2200
    2. After getting a shell to the first pod, train the model with DDL:
      oc exec -it ddl-instance-ibm-wmlce-0 bash
      ddlrun --mpiarg '-mca btl_tcp_if_exclude lo,docker0,docker_gwbridge' --tcp --hostfile /wmlce/config/hostfile python 
      $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl
      Note: If non-routable network interfaces are present on the machines, use mpiarg to specify the network interface for communication, where rootable_interface is the interface to use:
      --mpiarg '-x NCCL_SOCKET_IFNAME=rootable_interface'

    Deploying WML CE DDL with InfiniBand cross node communication

    1. Increase the ulimit settings for the Docker daemon by adding this command to the Docker daemon on all the worker nodes:
      --default-ulimit memlock=-1
    2. Create container SSH keys as a Kubernetes secret:
      mkdir -p .tmp
      yes | ssh-keygen -N "" -f .tmp/id_rsa
      oc create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa--from-file=id_rsa.pub=.tmp/id_rsa.pub
    3. Deploy the InfiniBand device plugin:
    4. Install the latest Mellanox OFED driver user-space on a WML CE Docker container:
      1. Download the latest MOFED into the container.
      2. Install the needed packages, decompress archive, and run the installer:
        sudo apt-get update; sudo apt-get install -y lsb-release perltar 
        -xzvf MLNX_OFED_LINUX-*MLNX_OFED_LINUX-*-ppc64le/mlnxofedinstall --user-space-only --without-fw-update--all -q
    5. Create a Docker image from this container and store it in a registry accessible by all the worker nodes.
    6. Deploy the WML CE Helm Chart with InfiniBand communication:
      helm install --name ddl-instance --set license=accept Path_of_Chart 
      --set resources.gpu=8 --set paiDistributed.mode=true --set paiDistributed.sshKeySecret=sshkeys-secret 
      --set paiDistributed.useInfiniBand=true --set image.repository=my_docker_repo --set image.tag=wmlce-mofed
      name
      The deployment name.
      set resources.gpu
      The total number of requested GPUs.
      set paiDistributed.mode
      Enable Distributed Deep Learning.
      set paiDistributed.sshKeySecret
      Name of the Kubernetes secret containing the SSH keys.
      set paiDistributed.useInfiniBand
      Use InfiniBand for communication.
      set image.repository
      Repository containing WML CE image with MOFED installed.
      set image.tag
      Tag of the WML CE image with MOFED installed
    7. Check that the pods were created and wait until they are in a running and ready state. One pod per worker node is created. DDL deployments always take all the GPUs of a node. Run kubectl describe pod pod_name to get more information about a pod.
      oc get pod -l app=ddl-instance-ibm-wmlce
      NAME                         READY     STATUS    RESTARTS   AGE
      ddl-instance-ibm-wml-ce-0   1/1       Running   0          30s
      ddl-instance-ibm-wml-ce-1   1/1       Running   0          30s
    8. Get a shell to the first pod and run the activation script. This example uses the TensorFlow framework with the High performance benchmarks:
      oc exec -it ddl-instance-ibm-wmlce-0 bash
    9. Train the model with DDL using InfiniBand.
      ddlrun -m bn --hostfile /wmlce/config/hostfile python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py 
      --model resnet50 --batch_size 64 --variable_update=ddl
      hostfile
      Use the auto generated host file available inside the pod.
      The run output should display the IBM Corp: DDL banner and for this model, the total images/sec.
      I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
      ...
      ----------------------------------------------------------------
      total images/sec: 2284.62
      ----------------------------------------------------------------
    10. Delete your deployment.
      helm delete ddl-instance --purge


    ------------------------------
    Dave Andrews
    Head of Customer Engagement
    Rocket Software
    South Salem NY United States
    ------------------------------