Cloud-based computing has changed the face of IT for the better. It has enabled organizations to make use of high-performance resources without requiring large IT groups. And it has enabled a supply of production-ready applications to companies who might not otherwise be able to access them. This is especially true in the world of Deep Learning, where high performance, scalability, and flexibility are critical.
But what if your organization can’t make use of the public cloud?
All of those benefits would slip through your grasp. At least they would, unless there was a way to offer them with on-premises equipment.
The challenges of developing and deploying your Deep Learning applications on-premises are well known to organizations that have attempted it. Simply getting the platform into an operational state is daunting enough, given the number and complexity of the software components that need to interoperate.
And once you have pieced together that puzzle, the challenges of maintenance and system management raise their head. Furthermore, the platform must provide the performance benefits that are the primary goal of the exercise. Otherwise, what is the point?
Finally, once the software actually performs as expected in the happy path, there is a major uphill battle in making the solution predictable and robust, handling the corner cases, and anticipating the unexpected user behaviors that could break the system. All this is critical to an enterprise-grade solution.
Who wants to do all that? The primary goal of a data scientist is to address messy and difficult real world problems, not to become an IT expert.
NVIDIA®, Lenovo™, Mellanox®, and One Convergence have partnered to bring cloud-native capability to the on-premises Deep Learning community. The companies provide fast, robust scale-out systems that offer the best of both worlds: simplicity and performance. And at an outstanding set of cost points.
Let’s explore in more detail the goals of this partnership, the challenges that need to be overcome, and the resulting solution.
If you ask a dozen different organizations who specialize in Deep Learning what their ideal environment would look like, you will get a surprisingly similar answer. They all suffer from the same obstacles getting their systems to production.
There are different ways to add resources to your system, each with advantages and challenges. The simplest approach, referred to as vertical scale-up, consists of a single highly specialized node, where additional resources are added to the box as needed. This works for many high-performance, compute-intensive applications, but does not always have the highest utilization for Deep Learning.
From a performance perspective, scale-up is the gold standard with the highest throughput, and it offers the most straightforward management footprint.
However, since any resource limit in Deep Learning is bad news, a fully scale-up implementation is seldom enough. A better approach is to expand the resources in a horizontal scale-out manner. Scale-out is implemented by adding nodes to a cluster, and treating them as a single entity from a resource and management perspective. This has a number of advantages:
Of course, the best approach is to allow both scale-up and scale-out, since each has its place. A hybrid approach must make the entire cluster management seamless, regardless of where the resources reside.
Given the clear advantages of horizontal scale-out, you might ask why everyone doesn’t do it that way. Because it’s hard. It’s relatively straightforward to hook up the hardware, but beyond that nothing comes easy, especially in a production environment, when - literally - time is money.
In order to address the requirements outlined, and to overcome the challenges inherent in on-premises computing, Mellanox®, Lenovo™, One Convergence, and NVIDIA® are offering a coordinated, production-ready Deep Learning environment.
The fundamental underpinning of the solution is a set of platforms - hardware and software - that are guaranteed to operate in a production setting. These platforms offer state-of-the-art components that are assembled and pre-tested to meet customer goals.
Lenovo™ is a leader in high performance, affordable data center platforms. The Deep Learning system described in this paper features a ThinkSystem™ SD530 rack server, which forms a solid, scalable foundation for the solution.
NVIDIA® technology has been a driving force in expanding the capabilities of Deep Learning. The Tesla V100 is at the heart of this revolution in artificial intelligence.
Mellanox® has been a leader in high performance connectivity since the company’s founding. The Intelligent ConnectX®-5 adapter cards offer the performance and features needed for on-premises Deep Learning.
The One Convergence DKube application builds on - and brings together - the hardware subsystems and drivers that form the infrastructure.
In order to demonstrate the results of the work that has been described, it is important that the solution - especially leadership performance - can be attained on an actual system built as per these discussions.
Image classification is a typical use case for Deep Learning. Applications for image classification include security, healthcare, self-driving cars, and text recognition. It is very difficult for traditional software algorithms to handle, since each image of the entity is likely to be different; shifted, distorted, misaligned, etc. Deep neural networks are well suited to handling the types of distortions that brittle algorithms struggle with.
However, the nature of these massive neural networks strains traditional compute processors and requires an accelerated computing platform capable of achieving hundreds of teraFLOPS of computation. NVIDIA® V100 GPUs enable rapid training of deep learning models, solving in hours - or minutes - what traditional processors have in the past taken days or weeks to solve.
In order to help data scientists compare different platforms for image classification, a set of benchmarks has been developed that take standard models and measure the training performance.
The benchmark that was used for the measurements in this paper is tf-cnn, a common TensorFlow benchmark. The model is ResNet-50 and the training uses synthetic data.
The system used for the test has the following specifications
The benchmark results demonstrate that the performance scales as more GPUs are added through additional nodes, and that it scales linearly when RDMA is enabled.
This article has outlined the primary reasons for using a scale-out or hybrid approach, rather than simple scale-up, where performance requirements go beyond what can be achieved in a single box. Although a scale-out system is more difficult to deploy and manage, the complexity can be hidden by close cooperation among the vendors, and by a software solution that makes the best use of that cooperation.
Lenovo™, NVIDIA®, Mellanox®, and One Convergence have formed a partnership that addresses the issues inherent in scale-out deployment, and offer a simple, flexible, high-performance set of products that are tested to work together in an enterprise environment.
The companies have demonstrated that the resulting system offers linear performance scaling.
You can find out more about these solutions by visiting www.dkube.io
Reproducibility is the foundational principle of the scientific method. If an experiment cannot be repeated, it is assumed to be faulty. DKube, an end-to-end Kubeflow-based MLOps platform, offers complete reproducibility into an integrated workflow. Without the ability to trace and repeat your work, it is not science
The next generation of enterprise applications will increasingly be AI/ML models applied to accelerate existing processes or solve new problems such as accelerating drug discovery and development in life sciences. Kubeflow is an open source reference architecture for AI/ML platform initiated by Google and contributed by several IT platform infrastructure leaders in the industry such as IBM, Redhat, Cisco, Dell, AWS for on-prem and hybrid deployment of AI/ML.