|
1 |
| -# ML-Random-Forest-Python |
2 |
| -This repository will help in understanding the basic concept of Random Forest algorithm and will also learn how to optimize the hyperparameters and prevent overfitting. |
| 1 | +# Random Forest |
| 2 | +Random forest is an **Ensemble** machine learning algorithm. Earlier the decision of classification was based on one model, either a logistic or a decision tree, Random Forest uses n decision trees to classify a sample in it's respective class. It uses different parts of data to build trees and each tree has a say in final decision making. |
| 3 | + |
| 4 | +Also, there are a few shortcomings of decision trees: |
| 5 | +* They can fit the data in hand well but when it comes to classifying new data, they do not have enough flexibility to accomodate that |
| 6 | + |
| 7 | +Random Forest combines the simplicity of a decision tree with flexibity to accomodate new data resulting into vast improvement in accuracy. |
| 8 | + |
| 9 | +## Steps to create a Random Forest |
| 10 | + |
| 11 | +* Bootstrapping: We randomly select a sample from our dataset, here one record can be selected multiple times |
| 12 | +* Now we create a decision tree using the bootstrap dataset, say we have 10 independent variables |
| 13 | + * Here, at each step, we need to randomply select n variables and check which splits the best |
| 14 | + * Say we have fixed the # of random variables for making decision at each step to 4 |
| 15 | + * For root node, we randomly select 4 out of 10 variables |
| 16 | + * For each variable, we check which gives the best split, same as decision trees |
| 17 | + * Say variable 3 gave the best split, then it'll become our root node |
| 18 | + * The root node splits into 2 nodes, one in right and one in left, now we take the left one |
| 19 | + * We consider all variables apart from variable 3(as it's already used in root node) and randomly select 4 out of 9 |
| 20 | + * We continue this process until we have a fully grown decision tree |
| 21 | + * We can choose the number of variables required to analyzed to create a node, it can be any number <= total variables |
| 22 | + * NOTE- all variables are up for selection in random list apart from the one which has split the node we are working on |
| 23 | +* The above steps are repaeated 100s of times (from bootstrapping to making a tree based on above logic) |
| 24 | +* This complete process results into a wide variety of trees, this variety makes Random Forest more effictive than a decision tree |
| 25 | + |
| 26 | +## Decision Making using Random Forest |
| 27 | + |
| 28 | +* Say we have built a 100 trees using the above algorithm |
| 29 | +* We have got a new sample, we run the new sample through all the trees |
| 30 | +* 75 trees classify the sample as 1 and 25 as 0 |
| 31 | +* We will classify the new sample as 1 as it gets majority of votes |
| 32 | + |
| 33 | +This complete process including Bootstrapping the data and using aggregate votes to make a decision is called **Bagging**. |
| 34 | + |
| 35 | +## Measuring How Good a Random Forest is |
| 36 | + |
| 37 | +* Say we had 1000 records in our training dataset |
| 38 | +* When we built the 1st bootstrapped dataset, we got 750 distinct records and since some were chosen multiple times, we had total of 1000 records |
| 39 | +* This means 250 records were not chosen in bootstrapped dataset 1 |
| 40 | +* We run these 250 records through all trees and check if they are correctly classified |
| 41 | +* We calculate the error for missclassification |
| 42 | +* Same is done for all the not selected records for each bootstrapped dataset |
| 43 | +* Say in every bootstrapped dataset we had 250 records not selected (random ones) we will get 250 * Total bootstrapped trees made i.e. 100 = 25,000 records to be checked for accuracy for Random Forest |
0 commit comments