I am interested by advanced analytics related to supply chain, and decided to take on a modeling project to gain experience and better understand one particular challenge: predicting whether a product will go on backorder.
Product backorder may be the result of strong sales performance (e.g. the product is in such high demand that production cannot keep up with sales). However, backorders can upset consumers, lead to canceled orders and decreased customer loyalty. Companies want to avoid backorders, but also avoid overstocking every product (leading to higher inventory costs).
I used data from Kaggle, which had approx. 1.9 million observations of parts in an 8 week period. The source of the data is unreferenced.
Predictors: Current inventory, sales history, forecasted sales, recommended stocking amount, part risk flags etc. (22 predictors in total)
- Imbalanced outcome: Only 0.7% of parts actually go on backorder.
- Outliers and skewed predictors: Part quantities (stock, sales etc.) can be on very different scales.
- Missing data: A few variables have data that are missing (not at random).
- n>>p: There are many observations (1.9 million) relative to the number of predictors (22).
We made several modeling decisions to address these issues:
- Perform well with imbalanced data typically
- Robust to outliers and the skewed predictors: Because they are using tree partitioning algorithms and not producing coefficient estimates, outliers and skewness are not as much of a concern as for other predictive models.
- Dealing with missing data: The few variables with missing data had mean imputed, and a binary variable was created to indicate whether the observation had missing data, in hopes to account for the missing data not being random.
- KFold: We use 10-fold cross-validation in order to tune model parameters (maximum number of variables to try and minimum leaf size), as well as compare model performance.
- The ROC Area Under the Curve (AUC) was used as a validation metric because the outcome is so imbalanced. By looking at ROC curves, we may determine a cutoff threshold for classification after fitting the models, rather than naively assuming a threshold of 0.5.