
DKube uses an innovative hub and spoke architecture to integrate the remote Slurm cluster into the MLOps workflow, and communication happens through simple plug-ins. This has the following advantages:
- Loose integration allows the 2 domains (MLOps & Slurm) to use their own tools, disciplines, administration, and workflows
- It is non-intrusive to the HPC system
- ML workloads can be run on the compute-intensive HPC system on demand
The primary activity happens on the hub, a Kubeflow-based framework that runs Kubernetes containers. This handles:
- The management of the system
- The data sources
- Metadata storage
- Job management
- Automation
- Model management
The HPC/Slurm cluster is the spoke in the architecture, and there can be more multiple Slurm clusters in the system. The Slurm cluster:
- Executes the job using Singularity
- Communicates with the DKube hub
Adding a remote HPC/Slurm cluster to the DKube Kubernetes hub is quick and straightforward. The information required to access the cluster, including the credentials, is entered from the DKube UI. This creates a link between the clusters so that they can be viewed as a single MLOps entity.