Over the last decade enterprises have made heavy investments in High Performance Computing (HPC) to solve complex scientific problems. They have used Slurm to schedule these massively parallel jobs on large clusters of compute nodes with accelerated hardware. AI/ML uses similar hardware for deep learning model training and enterprises are looking to find solutions that provide AI/ML model development on top of their existing HPC infrastructure. A recent trend in AI/ML is to use agile MLOps methodologies to productionize AI/ML models quickly. Marrying the two - AI/ML development using MLOps with HPC/Slurm clusters - will lead to a much faster adoption of this combination. This article elaborates on how to combine popular open-source frameworks, Slurm and Kubeflow, to run AI/ML workloads at scale on HPC clusters.
High Performance Computing is used by specialized engineering and scientific applications. HPC workloads require a system that can perform extremely complex operations on massive datasets.A typical system contains a large number of compute nodes and a storage subsystem connected via an extremely fast network. The number of compute nodes can range from tens to tens of thousands, depending on the complexity and scale of the problem being solved. These compute nodes generally have compute accelerators such as GPUs.
Slurm is a very popular open-source platform that allows compute jobs to be scheduled on large Linux clusters. The platform is highly scalable and resilient. It is typically used on HPC clusters to distribute workloads and solve complex scientific problems. In the past the programs on compute nodes have executed on Linux processes, but more recently these have started to move to Linux containers. Singularity containers have become a popular way to run containers on a Slurm cluster. Singularity provides tools to convert Docker containers to Singularity containers.
Enterprises and research labs looking to solve these complex scientific problems have invested hundreds of millions of dollars on building Slurm-based HPC infrastructures and related software.
Enterprises in various industries are using AI-based deep learning methodologies popularized by hyperscalers to solve a range of problems that include autonomous driving, drug discovery, process automation, etc. Deep learning requires an infrastructure that is similar to that of HPC - GPU-accelerated compute nodes and large storage interconnected through a fast network
Kubernetes has become the de facto platform to run AI/ML workloads at scale. Open-source AI/ML platforms such as Kubeflow, and other commercial offerings, are generally built on top of Kubernetes. Increasingly, they are using MLOps methodology to productionize the models quickly.
Enterprises using HPC for traditional scientific algorithm development are also expanding rapidly into using AI/ML and deep learning to solve business and product problems. Other than the similarity of the hardware infrastructure (GPU-accelerated and networked compute nodes connected to large storage) the domains are distinctly different in toolsets, management, orchestration and development frameworks.
Enterprises running HPC and AI workloads would benefit significantly from using a common infrastructure for both workloads, particularly when they have invested millions of dollars into the HPC infrastructure. It would be highly beneficial if one could get the benefits of MLOps via a Kubernetes-based AI/ML platform and combine that with HPC/Slurm to get scale and resiliency.
One could approach this in 2 ways:
1) Integrate via a Slurm/Kubernetes operator
This method tightly couples Slurm cluster to Kubernetes cluster and makes the Slurm cluster look like an extension to Kubernetes nodes.
i) Tight coupling of Kubernetes to Slurm. Most tools from Kubernetes can be used
ii) Any Kubernetes workload can be scheduled
i) Difficulty to support an on-demand usage model
ii) Kubernetes semantics need to be supported by Slurm. This may not be clean.
iii) Slurm scale and resiliency may be not be supported
iv) Administration from Kubernetes
The following project uses this approach - https://github.com/sylabs/slurm-operator
2) Integrate Slurm with MLOps controller using a controller plugin
This uses hub (MLOps controller) and spoke (HPC/Slurm clusters) model where the clusters are loosely connected, allows enterprises to get the best of both worlds - Run traditional HPC workload and Deep learning AI/ML workloads, and further use existing infrastructure
i) Loose integration and hence much simpler model
ii) Run on demand AI/ML workloads such as compute intensive workloads such as model training at scale
iii) Independent clusters, independent administration & usage domains
iv) Continue supporting a traditional HPC/Slurm environment
v) Less intrusive
i) Limited to AI/ML
ii) Limited to job-related activity such as automated model training
It is our belief that the second option is more feasible for real-world implementations. As highlighted previously in this article, the tools, frameworks, and workflows are different between the two worlds, and they are being developed and maintained by different organizations and users, with different goals and methodologies. Keeping the different approaches in sync is difficult, and there are going to be compatibility issues that will be impractical to bridge.
Allowing each domain to develop independently, with communication enabled through plug-ins, is a much better approach. In addition to allowing each domain to progress as its own requirements dictate, it also allows compatibility to be ensured with a relatively thin layer of software.
DKube is a commercial MLOps offering that is built on top of best-of-breed open-source AI/ML platforms such as Kubeflow & MLflow. It integrates with best-in-class AI components such as PyTorch, TensorFlow, Scikit Learn, JupyterLab, RStudio, and many more.
DKube implements a hub and spoke model to integrate HPC/Slurm. The hub runs the MLOps management and control plane and the associated Kubernetes workloads. Slurm is the spoke that is integrated into DKube through a Slurm plug-in. The plug-in communicates with the Slurm/Schedmd and schedules the Slurm jobs. All the semantics of Slurm are retained and the job description is provided via slurm scripts provided by the user. These jobs can be a Kubeflow pipeline stage, a Kubeflow AutoML/Katib job, or just a DKube job. Each job can be any traditional Slurm job that uses all of its capabilities.
In this implementation, the DKube hub and the Slurm spoke are in different administrative domains connected loosely via a DKube plug-in running on the DKube cluster. The plug-in understands the Slurm semantics. DKube implements the complete MLOps workflow, and runs the associated AI/ML workloads on Kubernetes, while the HPC/Slurm cluster runs the traditional HPC workloads. When a DKube job is required to run on HPC/Slurm it communicates via the plug-in.
The following workflow is used for an AI/ML model executed on an HPC/Slurm cluster:
In order to learn more about DKube, please visit www.dkube.io
Want to learn how to monitor your models in production? The DKube platform integrates model monitoring into the overall system with DKube Monitor. It includes everything necessary for engineers and executives to identify how well your models are achieving their business goals - and facilitates a smooth workflow to improve them when necessary.