Skip to content

Commit 23f2a9d

Browse files
Update README.md
1 parent 64e51f8 commit 23f2a9d

File tree

1 file changed

+43
-2
lines changed

1 file changed

+43
-2
lines changed

README.md

+43-2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,43 @@
1-
# ML-Random-Forest-Python
2-
This repository will help in understanding the basic concept of Random Forest algorithm and will also learn how to optimize the hyperparameters and prevent overfitting.
1+
# Random Forest
2+
Random forest is an **Ensemble** machine learning algorithm. Earlier the decision of classification was based on one model, either a logistic or a decision tree, Random Forest uses n decision trees to classify a sample in it's respective class. It uses different parts of data to build trees and each tree has a say in final decision making.
3+
4+
Also, there are a few shortcomings of decision trees:
5+
* They can fit the data in hand well but when it comes to classifying new data, they do not have enough flexibility to accomodate that
6+
7+
Random Forest combines the simplicity of a decision tree with flexibity to accomodate new data resulting into vast improvement in accuracy.
8+
9+
## Steps to create a Random Forest
10+
11+
* Bootstrapping: We randomly select a sample from our dataset, here one record can be selected multiple times
12+
* Now we create a decision tree using the bootstrap dataset, say we have 10 independent variables
13+
* Here, at each step, we need to randomply select n variables and check which splits the best
14+
* Say we have fixed the # of random variables for making decision at each step to 4
15+
* For root node, we randomly select 4 out of 10 variables
16+
* For each variable, we check which gives the best split, same as decision trees
17+
* Say variable 3 gave the best split, then it'll become our root node
18+
* The root node splits into 2 nodes, one in right and one in left, now we take the left one
19+
* We consider all variables apart from variable 3(as it's already used in root node) and randomly select 4 out of 9
20+
* We continue this process until we have a fully grown decision tree
21+
* We can choose the number of variables required to analyzed to create a node, it can be any number <= total variables
22+
* NOTE- all variables are up for selection in random list apart from the one which has split the node we are working on
23+
* The above steps are repaeated 100s of times (from bootstrapping to making a tree based on above logic)
24+
* This complete process results into a wide variety of trees, this variety makes Random Forest more effictive than a decision tree
25+
26+
## Decision Making using Random Forest
27+
28+
* Say we have built a 100 trees using the above algorithm
29+
* We have got a new sample, we run the new sample through all the trees
30+
* 75 trees classify the sample as 1 and 25 as 0
31+
* We will classify the new sample as 1 as it gets majority of votes
32+
33+
This complete process including Bootstrapping the data and using aggregate votes to make a decision is called **Bagging**.
34+
35+
## Measuring How Good a Random Forest is
36+
37+
* Say we had 1000 records in our training dataset
38+
* When we built the 1st bootstrapped dataset, we got 750 distinct records and since some were chosen multiple times, we had total of 1000 records
39+
* This means 250 records were not chosen in bootstrapped dataset 1
40+
* We run these 250 records through all trees and check if they are correctly classified
41+
* We calculate the error for missclassification
42+
* Same is done for all the not selected records for each bootstrapped dataset
43+
* Say in every bootstrapped dataset we had 250 records not selected (random ones) we will get 250 * Total bootstrapped trees made i.e. 100 = 25,000 records to be checked for accuracy for Random Forest

0 commit comments

Comments
 (0)