Test WML CE in OpenShift

Forum|Forum|4 years ago
September 16, 2021
0 replies
2 views

David Andrews
Rocketeer

WML CE in OpenShift

Follow these steps to deploy IBM Watson® Machine Learning, Distributed Deep Learning (DDL) directly into your enterprise private cloud with Red Hat OpenShift 3.x by using TCP or InfiniBand communication between the worker nodes.

Before you begin

Install the OpenShift and Helm CLI to deploy your application from the command line. After installing the CLI, add the IBM® Helm Chart repository.

Refer to the wmlce-openshift README for more details

Deploying WML CE DDL with TCP cross node communication

Create container SSH keys as a Kubernetes secret.

mkdir -p .tmp
yes | ssh-keygen -N "" -f .tmp/id_rsa
oc create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub

Deploy the WML CE OpenShift Helm Chart with DDL enabled:
```
helm install --name ddl-instance --set license=accept Path_of_Chart --set resources.gpu=8 --set paiDistributed.mode=true 
--set paiDistributed.sshKeySecret=sshkeys-secret
```
name

The deployment name.

set resources.gpu

The total number of requested GPUs.

set paiDistributed.mode

Enable Distributed Deep Learning.

set paiDistributed.sshKeySecret

Name of the Kubernetes secret containing the SSH keys.
Verify that the pods were created and wait until they are in a running and ready state. One pod per worker node is created. DDL deployments always take all the GPUs of a node. Run kubectl describe pod pod_name to get more information about a pod.
```
oc get pod -l app=ddl-instance-ibm-wmlce
```
```
NAME                         READY     STATUS    RESTARTS   AGE
ddl-instance-ibm-wmlce-0   1/1       Running   0          30s
ddl-instance-ibm-wmlce-1   1/1       Running   0          30s
```

Train the model with DDL:

ddlrun --mpiarg '-mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0' --tcp --hostfile /wmlce/config/hostfile 
python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl

mpiarg: The network interface to use for MPI and NCCL. In this example, eth0 connects the different nodes.
hostfile: Use the auto generated host file available inside the pod.

The run output should display the IBM Corp: DDL banner and for this model, the total images/sec.

I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 2284.62
----------------------------------------------------------------

Delete your deployment.
```
helm delete ddl-instance --purge
```

Using the host network on a container

The Helm Chart provides the option to use the host network for communication to get better performance. The potential disadvantage of this option is that all the host network interfaces will be visible inside the container. The host SSH uses port 22, so a different port must be selected when using this option. The following steps are an example of deploying and running with the host network:

Deploy the WML CE Helm Chart:

helm install --name ddl-instance --set license=accept Path_of_Chart --set resources.gpu=8 
--set paiDistributed.mode=true --set paiDistributed.sshKeySecret=sshkeys-secret --set paiDistributed.useHostNetwork=true 
--set paiDistributed.sshPort=2200

After getting a shell to the first pod, train the model with DDL:

oc exec -it ddl-instance-ibm-wmlce-0 bash
ddlrun --mpiarg '-mca btl_tcp_if_exclude lo,docker0,docker_gwbridge' --tcp --hostfile /wmlce/config/hostfile python 
$CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl

Note: If non-routable network interfaces are present on the machines, use mpiarg to specify the network interface for communication, where rootable_interface is the interface to use:

--mpiarg '-x NCCL_SOCKET_IFNAME=rootable_interface'

Deploying WML CE DDL with InfiniBand cross node communication

Increase the ulimit settings for the Docker daemon by adding this command to the Docker daemon on all the worker nodes:
```
--default-ulimit memlock=-1
```

Create container SSH keys as a Kubernetes secret:

mkdir -p .tmp
yes | ssh-keygen -N "" -f .tmp/id_rsa
oc create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa--from-file=id_rsa.pub=.tmp/id_rsa.pub

Deploy the InfiniBand device plugin:

oc -n kube-system apply -f https://raw.githubusercontent.com/nimbix/k8s-rdma-device-plugin/deploy-bionic/rdma-device-plugin.yml

Install the latest Mellanox OFED driver user-space on a WML CE Docker container:
1. Download the latest MOFED into the container.
2. Install the needed packages, decompress archive, and run the installer:
```
sudo apt-get update; sudo apt-get install -y lsb-release perltar 
-xzvf MLNX_OFED_LINUX-*MLNX_OFED_LINUX-*-ppc64le/mlnxofedinstall --user-space-only --without-fw-update--all -q
```
Create a Docker image from this container and store it in a registry accessible by all the worker nodes.
Deploy the WML CE Helm Chart with InfiniBand communication:
```
helm install --name ddl-instance --set license=accept Path_of_Chart 
--set resources.gpu=8 --set paiDistributed.mode=true --set paiDistributed.sshKeySecret=sshkeys-secret 
--set paiDistributed.useInfiniBand=true --set image.repository=my_docker_repo --set image.tag=wmlce-mofed
```
name

The deployment name.

set resources.gpu

The total number of requested GPUs.

set paiDistributed.mode

Enable Distributed Deep Learning.

set paiDistributed.sshKeySecret

Name of the Kubernetes secret containing the SSH keys.

set paiDistributed.useInfiniBand

Use InfiniBand for communication.

set image.repository

Repository containing WML CE image with MOFED installed.

set image.tag

Tag of the WML CE image with MOFED installed
Check that the pods were created and wait until they are in a running and ready state. One pod per worker node is created. DDL deployments always take all the GPUs of a node. Run kubectl describe pod pod_name to get more information about a pod.
```
oc get pod -l app=ddl-instance-ibm-wmlce
```
```
NAME                         READY     STATUS    RESTARTS   AGE
ddl-instance-ibm-wml-ce-0   1/1       Running   0          30s
ddl-instance-ibm-wml-ce-1   1/1       Running   0          30s
```
Get a shell to the first pod and run the activation script. This example uses the TensorFlow framework with the High performance benchmarks:
```
oc exec -it ddl-instance-ibm-wmlce-0 bash
```

Train the model with DDL using InfiniBand.

ddlrun -m bn --hostfile /wmlce/config/hostfile python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py 
--model resnet50 --batch_size 64 --variable_update=ddl

hostfile: Use the auto generated host file available inside the pod.

The run output should display the IBM Corp: DDL banner and for this model, the total images/sec.

I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 2284.62
----------------------------------------------------------------

Delete your deployment.
```
helm delete ddl-instance --purge
```

------------------------------
Dave Andrews
Head of Customer Engagement
Rocket Software
South Salem NY United States
------------------------------

WML CE in OpenShift

Before you begin

Deploying WML CE DDL with TCP cross node communication

Using the host network on a container

Deploying WML CE DDL with InfiniBand cross node communication

Recent badge winners

Sign up

Please log in or register:

Welcome to the Rocket Forum!

Please log in or register:

Scanning file for viruses.

This file cannot be downloaded