How to launch hyperparameter tuning runs taking advantage of the Katib based hyperparameter tuning available in Kubeflow. Pick the best model from the multiple training runs as the winner based on pre-set criteria.
Over the last decade enterprises have made heavy investments in High Performance Computing (HPC) to solve complex scientific problems. They have used Slurm to schedule these massively parallel jobs on large clusters of compute nodes with accelerated hardware. AI/ML uses similar hardware for deep learning model training and enterprises are looking to find solutions that provide AI/ML model development on top of their existing HPC infrastructure. A recent trend in AI/ML is to use agile MLOps methodologies to productionize AI/ML models quickly. Marrying the two - AI/ML development using MLOps with HPC/Slurm clusters - will lead to a much faster adoption of this combination. This article elaborates on how to combine popular open-source frameworks, Slurm and Kubeflow, to run AI/ML workloads at scale on HPC clusters.
How to set-up a Jupyter Notebook or R-Studio IDE to import or write your program code including Kubeflow pipeline DSL. Learn about the supported ML frameworks or custom image import into your notebook.
Want to learn how to monitor your models in production? The DKube platform integrates model monitoring into the overall system with DKube Monitor. It includes everything necessary for engineers and executives to identify how well your models are achieving their business goals - and facilitates a smooth workflow to improve them when necessary.