Many organizations have HPC clusters with large compute and GPU resource pools. Tapping those resources for AI/ML workloads can be cumbersome, requiring hand-built plug-ins, open source libraries and tools by researchers, students, or individual employees duplicating cost and effort while reducing collaboration.
Moreover AI/ML models often need to maintain traceability, lineage, and governance required by the regulatory or safety bodies in an industry or country. That is available in commercial MLOps platforms which on the other hand were not built to take advantage of HPC compute and GPU resources.
With DKube you can offload your data pre-processing or AI training jobs to a Slurm cluster based on vSphere -as individual jobs/runs or as part of pipelines. Full traceability, lineage and logging of the work being performed is maintained in SQL database. Multiple HPC clusters can be attached while the control plane of the DKube MLOps platform runs on a Kubernetes cluster such as VMWare Tanzu providing you with all the core innovations of Kubeflow and MLFlow.
Please click here to receive a link to the recording in your email inbox.
How to set-up your first project in DKube by setting up and connecting with your code and data repositories. Learn what kind of code and data sources are available by default in DKube.
Learn how to integrate DKube with HPC/LSF clusters including configuring the initial set-up and scheduling pre-processing or training jobs including Kubeflow pipelines jobs and analyze results with MLFlow based model comparison metrics.
The next generation of enterprise applications will increasingly be AI/ML models applied to accelerate existing processes or solve new problems such as accelerating drug discovery and development in life sciences. Kubeflow is an open source reference architecture for AI/ML platform initiated by Google and contributed by several IT platform infrastructure leaders in the industry such as IBM, Redhat, Cisco, Dell, AWS for on-prem and hybrid deployment of AI/ML.