The Duchess of Windsor famously said that you could not be too rich or too thin. And whether or not that is correct, a similar observation is definitely true when trying to match deep learning applications and compute resources: you cannot have enough horsepower.
Intractable problems in fields as diverse as finance, security, medical research, resource exploration, self-driving vehicles, and defense are being solved today by “training” a complex neural network how to behave rather than programming a more traditional computer to take explicit steps. And even though the discipline is still relatively young, the results have been astonishing.
The training process required to take a Deep Learning model and turn it into a computerized savant is extremely resource-intensive. The basic building blocks for the necessary operations are GPUs, and though they are already powerful – and getting more so all the time – the kinds of applications identified will take whatever you can throw at them and ask for more.
In order to achieve the necessary horsepower, the GPUs need to be used in parallel, and there lies the rub. The easiest way to bring more GPUs to bear is to simply add them to the system. This scale-up approach has some real-world limitations. You can only get a certain number of these powerful beasts into a single system within reasonable physical, electrical, and power dissipation constraints. If you want more than that – and you do – you need to scale-out.
This means that you need to provide multiple nodes in a cluster, and for this to be useful the GPUs need to be shared among the nodes. Problem solved, right? It can be – but this approach brings with it a new set of challenges.
The basic challenge is that just combining a whole bunch of compute nodes into a large cluster – and making them work together seamlessly – is not simple. In fact, if it is not done properly, the performance could become worse as you increase the number of GPUs, and the cost could become unattractive.
Mellanox has partnered with One Convergence to solve the problems associated with efficiently scaling on-prem or bare metal cloud Deep Learning systems.
Mellanox supplies end-to-end Ethernet solutions that exceed the most demanding criteria and leave the competition in the dust. For example, we can easily see the performance advantages with TensorFlow over a Mellanox 100GbE network versus a 10GbE network, both taking advantage of RDMA in the chart below
RDMA over Converged Ethernet (RoCE) is a standard protocol which enables RDMA’s efficient data transfer over Ethernet networks allowing transport offload with hardware RDMA engine implementation, and superior performance. While distributed TensorFlow takes full advantage of RDMA to eliminate processing bottlenecks, even with large-scale images the Mellanox 100GbE network delivers the expected performance and exceptional scalability from the 32 NVIDIA Tesla P100 GPUs. For both 25GbE and 100GbE, it’s evident that those who are still using 10GbE are falling short of any return on investment they might have thought they were achieving.
Beyond the performance advantages, the economic benefits of running AI workloads over Mellanox 25/50/100GbE are substantial. Spectrum switches and ConnectX network adapters deliver unbeatable performance at an even more unbeatable price point, yielding an outstanding ROI. With flexible port counts and cable options allowing up to 64 fully redundant 10/25/50/100 GbE ports in a 1U rack space, Mellanox end-to-end Ethernet solutions are a game changer for state-of-the-art data centers that wish to maximize the value of their data.
The performance and cost of the system are the ticket in the door, but an attractive on-prem Deep Learning platform needs to go further; much further. You need to:
Many of these challenges have been overcome for organizations that are able to offload their compute needs to the public cloud. But for those companies that cannot take this path for regulatory, competitive, security, bandwidth or cost reasons, there has not been a satisfactory solution.
In order to achieve the goals highlighted above, One Convergence has created a full stack Deep Learning-as-a-service application called DKube that addresses these challenges for on-prem and bare metal cloud users.
DKube provides a variety of valuable and integrated capabilities:
At KubeCon + CloudNativeCon conference to be held in Seattle from Dec 10-13, 2018, you can see a demo of One Convergence DKube system using Mellanox high speed network adapters that enable Deep Learning-as-a-Service for on-prem platforms or bare metal deep learning cloud.
Reproducibility is the foundational principle of the scientific method. If an experiment cannot be repeated, it is assumed to be faulty. DKube, an end-to-end Kubeflow-based MLOps platform, offers complete reproducibility into an integrated workflow. Without the ability to trace and repeat your work, it is not science
How to deploy a preferred version of a model into production or to first push it to a model catalog for another gatekeeper to test and select a preferred model before pushing into production. The model catalog capability is unique to DKube and minimizes accidental escape to production of a model that may not be ready yet.
How to set-up and execute Kubeflow pipelines through DKube's JuypterLab notebook and how to manage or automate the pipeline runs through the UI. This is one of the hottest features of Kubeflow allowing you to set-up a multi-stage pipeline of data prep, feature engineering, training, and production deployment based on time triggers or other event triggers.