Reproducibility is the foundational principle of the scientific method. If an experiment cannot be repeated, it is assumed to be faulty. DKube, an end-to-end Kubeflow-based MLOps platform, offers complete reproducibility into an integrated workflow. Without the ability to trace and repeat your work, it is not science
The ability to track and reproduce your data is critical throughout the ML/DL process.
Table to be inserted
Reproducibility has some important aspects:
The overall workflow for developing a model can be summarized by the following general phases:
These phases can be combined in different ways depending upon the size and formality of the organization, but the basic approach is similar for most data science projects.
The ML Engineer phase is where reproducibility is the most valuable. The basic training code has been developed, and the entire environment needs to be optimized to address inference on real data.
With all those variables in play, the number of training runs and models can become large, and it is somewhere between challenging and impossible to analyze what option caused which outcome without some assistance from the platform.
DKube, an end-to-end MLOps platform, provides all this assistance automatically, and it is fully integrated into your workflow. DKube is based on Kubeflow, a standards-based platform that brings together best-in-class frameworks and systems. DKube extends this baseline to provide an integrated & supported DL/ML platform.
The first step in bringing order to this chaos is to use versioning when creating new models. When a training run is executed, the output can be either a new version of an existing model, or it can be an entirely new model.
Versioning is a way to combine models that have some common heritage, but with a limited number of differences in the input. For example, you might want to see how different hyperparameters impact your selected metrics. This can be used to compare the metrics to determine the best fit.
In this example based on DKube, the model lineage is shown after a training run. The input code and datasets are provided, along with any additional hyperparameters, and the training run is identified.
Navigating to the associated code, dataset, or run is accomplished by selecting it directly from the lineage box. From this screen, you can:
Creating a new training run with different data or hyperparameters is direct and simple. You can access the run from the lineage screen, and clone it right from there
By tracing back the lineage to the program or dataset, you can also see where else that code or dataset was used. This provides insight into how broadly your inputs are being selected for training. You want to ensure that you’re not using the same dataset over and over, for example, which might overfit your training to a specific dataset.
Finally, once you have a workflow established, DKube enables flexible and powerful automation through Kubeflow Pipelines or CI/CD.
DKube enables best-in-class components to be brought together for your experiments and training. And it allows data scientists to focus on the science.
One Convergence CEO Prasad Vellanki sat down to discuss the obstacles and promise of Deep Learning at the TF World 2019 show. Prasad offers a compelling vision of where the industry is headed, and explains how the company’s DKube product offers a powerful, flexible, and affordable Deep Learning solution for on-prem, cloud, and hybrid platforms.
Over the last decade enterprises have made heavy investments in High Performance Computing (HPC) to solve complex scientific problems. They have used Slurm to schedule these massively parallel jobs on large clusters of compute nodes with accelerated hardware. AI/ML uses similar hardware for deep learning model training and enterprises are looking to find solutions that provide AI/ML model development on top of their existing HPC infrastructure. A recent trend in AI/ML is to use agile MLOps methodologies to productionize AI/ML models quickly. Marrying the two - AI/ML development using MLOps with HPC/Slurm clusters - will lead to a much faster adoption of this combination. This article elaborates on how to combine popular open-source frameworks, Slurm and Kubeflow, to run AI/ML workloads at scale on HPC clusters.
There's a faster way to go from research to application. Find out how an MLOps workflow can benefit your teams.