
Certain classes of machine learning research -- such as life sciences, drug discovery, autonomous driving, and oil exploration -- require computational capability beyond what a standard server can normally provide. This class of training benefits significantly from a dedicated High Performance Computing (HPC) platform.
Until now, the obstacle has been that the MLOps workflow, based on Kubernetes, has different applications, frameworks, tools, workflows, and administration than HPC systems, often based on Slurm. [MLOps on HPC/Slurm with Kubeflow]
DKube™ removes this obstacle by allowing you to submit your ML jobs to a Slurm-based HPC system directly, and without any compromise on its Kubeflow foundation or its broad MLOps capabilities. This unleashes the advantages of both types of platforms, and enables use cases that would not otherwise be feasible.
The program code and datasets do not need to be modified. All required translation is handled automatically, and the remote execution supports all of the powerful features of the DKube MLOps platform. This includes:
- Integration into the end-to-end DKube MLOps workflow
- Local Kubernetes storage for MLOps metadata
- KFServing for production serving
- Access to MLFlow metric collection, display, & compare
- Lineage of every run & model for reproducibility, enhancement, governance, & audit
- Separate management of Kubernetes & HPC domains
- On demand use of the HPC training only when required
- Automatic versioning of data and models
- Hyperparameter tuning on job granularity
- Support for Kubeflow Pipelines with HPC jobs spawned from a pipeline step
- Full MLOps model workflow