Machine Learning offers the promise of revolutionizing AI. But to achieve that promise, it needs to move from something that only a few experts can use to a mainstream discipline. Until recently, only experts could navigate the complex patchwork of tools and processes required to create such a system.
MLOps is an innovative approach that integrates the full ML workflow into an easy-to-use package such that research domain experts can do their work without needing to become intimate with all the underlying components. MLOps allows developers to release models to production quickly using DevOps-like processes and ML automation for reproducibility.
MLOps is not just for large organizations, either. Any enterprise that works with more than a few trivial models will benefit from optimizing their ML investment in both people and resources. As your organization scales, using less integrated approaches will become expensive very quickly.
This article argues that the open source Kubeflow framework is the best foundation for an MLOps platform. It integrates a set of powerful components backed by strong, diverse community support. And with just a few enhancements it can offer the best of both worlds: compatibility with a popular community-based standard, but with the capabilities that enterprise customers demand. Finally, we explain how you can benefit from this combination of advantages today.
No structure can be stronger than its foundation. The most powerful and extensible platform available today is Kubeflow.
Kubeflow is a Kubernetes-based, open-source framework that integrates the key components necessary to develop and deploy complex machine learning models. It has a number of characteristics that make it ideal as the primary building block for an enterprise MLOps system.
Kubeflow is not built as a unified platform. Instead, it consists of a collection of components, packaged together through manifests. This makes it flexible and easy to customize. A sophisticated team can choose the specific components that apply to a particular workflow, thus providing a balance of flexibility and simplicity.
Kubeflow is standards-based. Where a standard exists that is powerful and popular, the Kubeflow framework includes it as part of its manifest. This includes:
Kubeflow goes beyond just pulling together existing tools. The diverse set of cloud and enterprise vendors and users in the Kubeflow community offer innovative solutions to enhance the workflow. Some important Kubeflow applications are:
One aspect of Kubeflow that is not always understood is that it is meant to be a reference architecture rather than a product. It is not expected to be push-button; you need quite a bit of expertise to get it running and to maintain it. Nor is it meant to be a complete MLOps platform on its own. As powerful and flexible as it is, Kubeflow expects that a supported enterprise-ready solution is provided by partners and vendors.
Enterprise customers have certain expectations in both capabilities and quality
To learn more about DKube, please visit www.dkube.io
The next generation of enterprise applications will increasingly be AI/ML models applied to accelerate existing processes or solve new problems such as accelerating drug discovery and development in life sciences. Kubeflow is an open source reference architecture for AI/ML platform initiated by Google and contributed by several IT platform infrastructure leaders in the industry such as IBM, Redhat, Cisco, Dell, AWS for on-prem and hybrid deployment of AI/ML.
DKube and VMWare are currently working closely to bring Kubeflow and MLFlow-based MLOps for enterprise companies needing on-prem and multi-cloud implementations of their AI projects.
Over the last decade enterprises have made heavy investments in High Performance Computing (HPC) to solve complex scientific problems. They have used Slurm to schedule these massively parallel jobs on large clusters of compute nodes with accelerated hardware. AI/ML uses similar hardware for deep learning model training and enterprises are looking to find solutions that provide AI/ML model development on top of their existing HPC infrastructure. A recent trend in AI/ML is to use agile MLOps methodologies to productionize AI/ML models quickly. Marrying the two - AI/ML development using MLOps with HPC/Slurm clusters - will lead to a much faster adoption of this combination. This article elaborates on how to combine popular open-source frameworks, Slurm and Kubeflow, to run AI/ML workloads at scale on HPC clusters.