Want to learn how to monitor your models in production? The DKube platform integrates model monitoring into the overall system with DKube Monitor. It includes everything necessary for engineers and executives to identify how well your models are achieving their business goals - and facilitates a smooth workflow to improve them when necessary.
The next generation of enterprise applications will increasingly be AI/ML models applied to accelerate existing processes or solve new problems such as accelerating drug discovery and development in life sciences. Kubeflow is an open source reference architecture for AI/ML platform initiated by Google and contributed by several IT platform infrastructure leaders in the industry such as IBM, Redhat, Cisco, Dell, AWS for on-prem and hybrid deployment of AI/ML.
Over the last decade enterprises have made heavy investments in High Performance Computing (HPC) to solve complex scientific problems. They have used Slurm to schedule these massively parallel jobs on large clusters of compute nodes with accelerated hardware. AI/ML uses similar hardware for deep learning model training and enterprises are looking to find solutions that provide AI/ML model development on top of their existing HPC infrastructure. A recent trend in AI/ML is to use agile MLOps methodologies to productionize AI/ML models quickly. Marrying the two - AI/ML development using MLOps with HPC/Slurm clusters - will lead to a much faster adoption of this combination. This article elaborates on how to combine popular open-source frameworks, Slurm and Kubeflow, to run AI/ML workloads at scale on HPC clusters.