Skip to content

kather0220/2022-2-Machine-Learning-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

💻2022-2-Machine-Learning-Project💻

2022-2 Machine Learning Project ( Team Crong )

0. Instruction

✅ We wrote the code on Google Colab. Here is our Colab notebook.

https://colab.research.google.com/drive/1zJUm-ulgHRSWLjiItuUfvvxZxjyxWD3K?usp=sharing

✅ You can reproduce our code in this notebook.

✅ Each time the runtime is restarted, the value of performance metric may change.

1. Problem Statement

1-1. Problem Definition

image

We noticed the main cause of medication errors as medical staff’s mistake. To prevent such accidents, we decided to apply a model that prescribes medications that are appropriate for a specific patient.

1-2. Project Outline

image

2. Chosen Model

image

To examine the highest accuracy, we compared 4 candidates, including decision tree, random forest, svm, and naïve bayes.

image

This graph shows training and testing scores. We finally chose random forest because it showed the highest accuracy.

2-1. Chose Hyper Parameters

image

At first, we used given representative parameters as default values, and tuned them later on at the hyperparameter tuning process referring to the model performance.

3. Experimental Setting

3-1. Dataset Settings

image

There are five input features, and the outputs are classified into 5 different types. The dataset size is 200, and we used 5-fold cross validation for more efficient use of data.

3-2. Data Categorization & Splitting the Dataset

image

For more efficient learning, we categorized the features ourselves. Age and Na_to_K are categorized by the range as shown, and others are set as binary as we can see in the result. The dataset was split into 70% of train set and 30% of test set.

3-3. Smote Technique

image

A problem of our dataset is that the number of drugY cases was too many. To avoid overfitting, we applied smote technique. By applying smote technique, we can conduct oversampling for our dataset.

3-4. Resource

image

The cpu and memory was normal size, and for our project, the resource was enough.

4. Results

4-1. Hyperparameter tuning

To improve our Random Forest classifier’s performance, we do hyperparameter tuning. The target parameters and values are as below.

image

After tuning, we get best parameters and best accuracy.

image

The accuracy of our model with best parameters is about 0.91. The initial accuracy of the model is about 0.879. Therefore, after hyperparameter tuning, we can get the better model.

4-2. Compare the results before & after hyperparameter tuning

Now, let’s compare the results before and after hyperparameter tuning. The accuracy, precision, recall and f1-score changed as follows. As you can see, all values increased slightly. If you want to check the exact code, please refer to the our ipynb file.

image

Here is the confusion matrix. There was no significant change, but the accurate predictions corresponding to the diagonal elements increased slightly.

image

Next is the ROC curves. The closer the graph is to the upper left, the better the performance. You can see that some graphs have gone up to the upper left after hyperparameter tuning.

image

Here is the PR curves. The closer the graph is to the upper right, the better the performance. You can see the graph have gone up to the upper right after hyperparameter tuning.

image

These are the feature Ranking. As you can see, after hyperparameter tuning, “Na_to_K_binned_20-30” got more importance. But overall, “Blood Pressure” is still the most important feature. Besides that, “Cholesterol_NORMAL” and “Na_to_K_binned_10-20” is less important than before. The yellow features are the ones that have changed.

image

As a result of comparing performance using evaluation metrics, it can be said that performance has improved after hyperparameter tuning.

4-3. Predicting New Data

We wil check if our new model prescribes to new patient properly. Let’s set the new patient’s feature list, and predict on that data.

image

As we expected, the drugY is prescribed to the new patient. This shows that our new model presribed well enough to new patients.

5. Interpretation

5-1. Correlation between features and the type of drug predicted by the model

Now let's look at the relationship between our model's predictions and the input features. The correlation graphs between each feature and drug type are as follows.

image

  • drugC and drugA are not taken by the children and teenager, whose ages are between 0 and 19.
  • drugA is taken only by the younger than 50 years old.
  • drugB is taken only by the older than 50 years old.
  • DrugY is taken by all ages.

image

  • Male people got drugA, drugB and drugC more than female people did.
  • Female people got DrugY and drugX more than male people did.
  • Just by this, 'sex' becomes quite appropriate feature for classification compared to beginning.

image

  • drugA and drugB were given only to people with HIGH blood pressure.
  • drugC was given to people with LOW blood pressure.
  • drugX was Not given to people with HIGH blood pressure.
  • Still, BP is an important feature for classification.

image

  • People with Na_to_K ratio bigger than 20, got DrugY.
  • Still, Na to K is an important feature for DrugY classification.

image

  • drugC was only given to people with HIGH cholesterol.
  • Cholesterol is an important feature to classify drugC.
  • Compared to the beginning, drugA and drugB is almost given to the people who have Normal cholesterol.
  • Therefore, drugA and drugB become an important feature to classify drugA and drugB.

5-2. Cause of misclassification

As shown below, 9 misclassifications occurred for a total of 60 test sets.

image

Then, let’s see the feature distributions of misclassified data.

image

There was no significant difference between the misclassified data, and the predicted result values did not show any obvious tendencies. One notable point was that Na_to_K 10-20 took a substantial proportion. This seems to be due to the ambiguity that can be classified into all drug types if they belong to this group. Also, people who should have been prescribed DrugY were likely to be misclassified, which is suspected to be due to the universal range of data. Therefore, the cause of the misclassified data is thought to be the ambiguity of classifying boundary values.

5-3. Conclusion

image

About

2022-2 Machine Learning Project ( Team Crong )

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •