The primary focus of AI/ML over the past several years has been to develop and deploy models that achieve the key organizational business goals. Clearly, that is a precondition of any viable development platform, since everything else in the workflow depends upon it.
However, every model comes with an expiry date. It is well known within the machine learning community that however good a job you do to train a model, your model results will degrade over time for a variety of reasons such as changing input data or business goals.
Therefore, the ability to automatically monitor the model results in real-time has become the focus of AI/ML tool development. The important steps in this capability are to determine when and why the model has drifted out of tolerance, and - more importantly - to quickly and directly go back to your original development environment in order to incrementally improve the results.
This is especially challenging for on-prem development, where the data needs to be near the machine learning system for reasons of compliance or security. This is common in the life sciences, insurance and federal/defense markets.
DKube™is based on Kubeflow, but one of its significant added values is the enhancement of the open standard to include important new capabilities. Kubeflow does not include model monitoring, and this is an area where DKube plugs that gap for a key function.
DKube is a standards-based, end-to-end platform, based on Kubeflow and MLFlow, enabling rapid, powerful machine learning development and production.
The DKube platform integrates model monitoring into the overall system with DKube Monitor. It includes everything necessary for engineers and executives to identify how well your models are achieving their business goals - and facilitates a smooth workflow to improve them when necessary.
DKube Monitor is available as part of the comprehensive DKube Full Suite, or can be licensed as a separate module to support development outside of DKube.
Learn more about the DKube MLOps platform
The DKube Monitor dashboard provides access to all of the models being monitored, and gives a quick indication of how they are performing. This enables:
- Management executives to review the overall serving outcomes, and how they compare to the business goals
- Production engineers to see more details about any deviations in order to figure out what is happening
- Data scientists to identify what needs to be changed in order to improve the inference serving outcomes
Monitor alerts are quick and easy to set up using the UI-based workflow.
Identifying the Problem
Once the degraded model has been identified, DKube Monitor enables the production engineer to determine why the outcomes are not within the required tolerances. This might be due to:
- The Input data changing so that it is no longer properly represented by the training data
- A conceptual problem with the code
- A new business goal that makes the current tolerance no longer acceptable
- New features that are shown to better correlate with the required outcomes
DKube Monitor allows the production engineer to drill down using an intuitive hierarchical approach, selecting the appropriate metrics and timeframe. Through iteration, the reason for the degradation will become clear.
Improving the Outcome
The monitoring and root cause identification workflow can be accomplished whether the model was developed and deployed within DKube, or in an external machine learning platform.
After the reasons for the degradation have been analyzed and the necessary changes have been identified, a retraining and redeployment process is initiated
If the model was developed and deployed from outside DKube, the model and environment must be matched up with the combination of inputs that was used for the model. Since, in this case, the development and deployment were done outside of DKube, the process will depend upon the AI/ML platform used.
If the model was developed and deployed within DKube using the Full Suite package, the process of identifying the model and environment, leading to retraining and redeployment, is simple and direct. This integrated workflow is described below.
Retraining the Model
Development accomplished using the DKube Full Suite package can take advantage of the powerful tracking and lineage capability that allows a model to be retrained quickly. The full set of inputs - including the feature sets, training code, datasets, and hyperparameters - for every deployed model is kept as part of the metadata. That means that once the reason for the output degradation is understood, the full training environment for the current model can be directly accessed and used as the starting point for incremental improvement.
Learn more about Tracking, Lineage & History
The changes can be simple or complex.
- If the reason for the degradation is simply that the training data needs to be updated to better reflect the current inference serving data, retraining is straightforward. The same code can be used, and the hyperparameters can be modified based on the new input dataset. This can even be automated through hyperparameter optimization or a Kubeflow pipeline.
- The problem might be more complex, and require new coding to take into account an updated feature engineering pipeline, or different algorithms. This is handed off to the development team through an issue management coordination system.
In both cases, the starting point is the existing set of inputs. This is quick and easy within DKube. The monitoring system takes you straight to the right model directly, and exposes the inputs. The current run is cloned, and any changes are made before kicking off a retraining session.
In most cases, the development and training will have been automated with Kubeflow Pipelines, and retraining is accomplished by triggering a pipeline run.
Learn more about Hyperparameter Optimization
Learn more about Automation
Deploying the Updated Model
The outcome of the retraining will be a new version of the model that better matches the business goals. If the deployment uses DKube Full Suite, the updated model can be pushed to the serving cluster, and the new model version will replace the existing one for live inference.
In this integrated scenario the existing monitor, with the current set of alerts and trigger points, can be automatically transferred to the new version without needing to create them over again. Any changes in what is being monitored, or what tolerances are acceptable, can be made at that time if required.