-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] MLOps for Dataset Drift detection #533
Comments
@albert17 could you pls review the PR and write down here what stats are already collected? Thanks. |
These are the stats being collected right now:
|
A very interesting and well written example by Google: |
Does NVTabular currently support a data drift detection module? Or does it integrate with an existing tool such as Evidentlyai.com? |
When a model is deployed in production, detecting changes and anomalies in new incoming data is critical to make sure that the predictions are valid and can be safely consumed. Therefore, users should be able to analyze drift in their data to understand how it changes over time. Data drift is one of the main reasons for degradation in model accuracy over time. Data drift occurs when statistical properties of input variables (model input data) change (e.g., due to seasonality, personal preferences, trends change).
One type of data drift is covariate shift, which refers to the change in the distribution of the input variables present in the training and the new data. Another type of drift in ML is concept drift which is shift in the relationship between the independent and the target variable. Simply put, what we are trying to predict (statistical properties of target variable) change over time.
There are existing tools for data and model monitoring. Examples: Azure ML datadriftdetector Module, scikit-multiflow, Databricks + MLflow , and Amazon SageMaker Model Monitor.
We'll want to add some of the commonly measured and monitored components to the dataset evaluator and to the dataset generation tool.
To detect data drift, we may want to collect some stats:
To test distribution differences in data there are some tests:
For model drift detection, we should save stats about model accuracy (we can call them accuracy drift metrics):
Basically compare base predictions with collected predictions.
The text was updated successfully, but these errors were encountered: