WML CE in OpenShift
Follow these steps to deploy IBM Watson® Machine Learning, Distributed Deep Learning (DDL) directly into your enterprise private cloud with Red Hat OpenShift 3.x by using TCP or InfiniBand communication between the worker nodes.
Before you begin
Install the OpenShift and Helm CLI to deploy your application from the command line. After installing the CLI, add the IBM® Helm Chart repository.
Refer to the wmlce-openshift README for more details
Deploying WML CE DDL with TCP cross node communication
- Create container SSH keys as a Kubernetes secret.
mkdir -p .tmp
yes | ssh-keygen -N "" -f .tmp/id_rsa
oc create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub
- Deploy the WML CE OpenShift Helm Chart with DDL enabled:
helm install --name ddl-instance --set license=accept Path_of_Chart --set resources.gpu=8 --set paiDistributed.mode=true
--set paiDistributed.sshKeySecret=sshkeys-secret
- name
- The deployment name.
- set resources.gpu
- The total number of requested GPUs.
- set paiDistributed.mode
- Enable Distributed Deep Learning.
- set paiDistributed.sshKeySecret
- Name of the Kubernetes secret containing the SSH keys.
- Verify that the pods were created and wait until they are in a running and ready state. One pod per worker node is created. DDL deployments always take all the GPUs of a node. Run
kubectl describe pod pod_name
to get more information about a pod.
oc get pod -l app=ddl-instance-ibm-wmlce
NAME READY STATUS RESTARTS AGE
ddl-instance-ibm-wmlce-0 1/1 Running 0 30s
ddl-instance-ibm-wmlce-1 1/1 Running 0 30s
- Train the model with DDL:
ddlrun --mpiarg '-mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0' --tcp --hostfile /wmlce/config/hostfile
python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl
- mpiarg
- The network interface to use for MPI and NCCL. In this example, eth0 connects the different nodes.
- hostfile
- Use the auto generated host file available inside the pod.
The run output should display the IBM Corp: DDL banner and for this model, the total images/sec.
I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 2284.62
----------------------------------------------------------------
- Delete your deployment.
helm delete ddl-instance --purge
Using the host network on a container
The Helm Chart provides the option to use the host network for communication to get better performance. The potential disadvantage of this option is that all the host network interfaces will be visible inside the container. The host SSH uses port 22, so a different port must be selected when using this option. The following steps are an example of deploying and running with the host network:
- Deploy the WML CE Helm Chart:
helm install --name ddl-instance --set license=accept Path_of_Chart --set resources.gpu=8
--set paiDistributed.mode=true --set paiDistributed.sshKeySecret=sshkeys-secret --set paiDistributed.useHostNetwork=true
--set paiDistributed.sshPort=2200
- After getting a shell to the first pod, train the model with DDL:
oc exec -it ddl-instance-ibm-wmlce-0 bash
ddlrun --mpiarg '-mca btl_tcp_if_exclude lo,docker0,docker_gwbridge' --tcp --hostfile /wmlce/config/hostfile python
$CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl
Note: If non-routable network interfaces are present on the machines, use
mpiarg
to specify the network interface for communication, where
rootable_interface is the interface to use:
--mpiarg '-x NCCL_SOCKET_IFNAME=rootable_interface'
Deploying WML CE DDL with InfiniBand cross node communication
- Increase the ulimit settings for the Docker daemon by adding this command to the Docker daemon on all the worker nodes:
--default-ulimit memlock=-1
- Create container SSH keys as a Kubernetes secret:
mkdir -p .tmp
yes | ssh-keygen -N "" -f .tmp/id_rsa
oc create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa--from-file=id_rsa.pub=.tmp/id_rsa.pub
- Deploy the InfiniBand device plugin:
- Install the latest Mellanox OFED driver user-space on a WML CE Docker container:
- Download the latest MOFED into the container.
- Install the needed packages, decompress archive, and run the installer:
sudo apt-get update; sudo apt-get install -y lsb-release perltar
-xzvf MLNX_OFED_LINUX-*MLNX_OFED_LINUX-*-ppc64le/mlnxofedinstall --user-space-only --without-fw-update--all -q
- Create a Docker image from this container and store it in a registry accessible by all the worker nodes.
- Deploy the WML CE Helm Chart with InfiniBand communication:
helm install --name ddl-instance --set license=accept Path_of_Chart
--set resources.gpu=8 --set paiDistributed.mode=true --set paiDistributed.sshKeySecret=sshkeys-secret
--set paiDistributed.useInfiniBand=true --set image.repository=my_docker_repo --set image.tag=wmlce-mofed
- name
- The deployment name.
- set resources.gpu
- The total number of requested GPUs.
- set paiDistributed.mode
- Enable Distributed Deep Learning.
- set paiDistributed.sshKeySecret
- Name of the Kubernetes secret containing the SSH keys.
- set paiDistributed.useInfiniBand
- Use InfiniBand for communication.
- set image.repository
- Repository containing WML CE image with MOFED installed.
- set image.tag
- Tag of the WML CE image with MOFED installed
- Check that the pods were created and wait until they are in a running and ready state. One pod per worker node is created. DDL deployments always take all the GPUs of a node. Run
kubectl describe pod pod_name
to get more information about a pod.
oc get pod -l app=ddl-instance-ibm-wmlce
NAME READY STATUS RESTARTS AGE
ddl-instance-ibm-wml-ce-0 1/1 Running 0 30s
ddl-instance-ibm-wml-ce-1 1/1 Running 0 30s
- Get a shell to the first pod and run the activation script. This example uses the TensorFlow framework with the High performance benchmarks:
oc exec -it ddl-instance-ibm-wmlce-0 bash
- Train the model with DDL using InfiniBand.
ddlrun -m bn --hostfile /wmlce/config/hostfile python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet50 --batch_size 64 --variable_update=ddl
- hostfile
- Use the auto generated host file available inside the pod.
The run output should display the IBM Corp: DDL banner and for this model, the total images/sec.
I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 2284.62
----------------------------------------------------------------
- Delete your deployment.
helm delete ddl-instance --purge
------------------------------
Dave Andrews
Head of Customer Engagement
Rocket Software
South Salem NY United States
------------------------------