WML CE in OpenShift
Follow these steps to deploy IBM Watson® Machine Learning, Distributed Deep Learning (DDL) directly into your enterprise private cloud with Red Hat OpenShift 3.x by using TCP or InfiniBand communication between the worker nodes.
Before you begin
Install the OpenShift and Helm CLI to deploy your application from the command line. After installing the CLI, add the IBM® Helm Chart repository.
Refer to the wmlce-openshift README for more details
Deploying WML CE DDL with TCP cross node communication
- Create container SSH keys as a Kubernetes secret.
- Deploy the WML CE OpenShift Helm Chart with DDL enabled:
- name
- The deployment name.
- set resources.gpu
- The total number of requested GPUs.
- set paiDistributed.mode
- Enable Distributed Deep Learning.
- set paiDistributed.sshKeySecret
- Name of the Kubernetes secret containing the SSH keys.
- Verify that the pods were created and wait until they are in a running and ready state. One pod per worker node is created. DDL deployments always take all the GPUs of a node. Run
kubectl describe pod pod_nameto get more information about a pod.NAME READY STATUS RESTARTS AGE ddl-instance-ibm-wmlce-0 1/1 Running 0 30s ddl-instance-ibm-wmlce-1 1/1 Running 0 30s - Train the model with DDL:
- mpiarg
- The network interface to use for MPI and NCCL. In this example, eth0 connects the different nodes.
- hostfile
- Use the auto generated host file available inside the pod.
The run output should display the IBM Corp: DDL banner and for this model, the total images/sec.I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ==== ... ---------------------------------------------------------------- total images/sec: 2284.62 ---------------------------------------------------------------- - Delete your deployment.
Using the host network on a container
The Helm Chart provides the option to use the host network for communication to get better performance. The potential disadvantage of this option is that all the host network interfaces will be visible inside the container. The host SSH uses port 22, so a different port must be selected when using this option. The following steps are an example of deploying and running with the host network:
- Deploy the WML CE Helm Chart:
- After getting a shell to the first pod, train the model with DDL:
Note: If non-routable network interfaces are present on the machines, use
mpiargto specify the network interface for communication, where rootable_interface is the interface to use:
Deploying WML CE DDL with InfiniBand cross node communication
- Increase the ulimit settings for the Docker daemon by adding this command to the Docker daemon on all the worker nodes:
- Create container SSH keys as a Kubernetes secret:
- Deploy the InfiniBand device plugin:
- Install the latest Mellanox OFED driver user-space on a WML CE Docker container:
- Download the latest MOFED into the container.
- Install the needed packages, decompress archive, and run the installer:
- Create a Docker image from this container and store it in a registry accessible by all the worker nodes.
- Deploy the WML CE Helm Chart with InfiniBand communication:
- name
- The deployment name.
- set resources.gpu
- The total number of requested GPUs.
- set paiDistributed.mode
- Enable Distributed Deep Learning.
- set paiDistributed.sshKeySecret
- Name of the Kubernetes secret containing the SSH keys.
- set paiDistributed.useInfiniBand
- Use InfiniBand for communication.
- set image.repository
- Repository containing WML CE image with MOFED installed.
- set image.tag
- Tag of the WML CE image with MOFED installed
- Check that the pods were created and wait until they are in a running and ready state. One pod per worker node is created. DDL deployments always take all the GPUs of a node. Run
kubectl describe pod pod_nameto get more information about a pod.NAME READY STATUS RESTARTS AGE ddl-instance-ibm-wml-ce-0 1/1 Running 0 30s ddl-instance-ibm-wml-ce-1 1/1 Running 0 30s - Get a shell to the first pod and run the activation script. This example uses the TensorFlow framework with the High performance benchmarks:
- Train the model with DDL using InfiniBand.
- hostfile
- Use the auto generated host file available inside the pod.
The run output should display the IBM Corp: DDL banner and for this model, the total images/sec.I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ==== ... ---------------------------------------------------------------- total images/sec: 2284.62 ---------------------------------------------------------------- - Delete your deployment.
------------------------------
Dave Andrews
Head of Customer Engagement
Rocket Software
South Salem NY United States
------------------------------