-
Notifications
You must be signed in to change notification settings - Fork 106
Description
Hey, I'm working on a research paper focused on building a binary classification model in the biomedical domain.The dataset comprises approximately 800 data points. Let's say I want to feed an heterogeneous pool of classifiers to the dynamic selection methods. By following the instructions on the examples, I've found two different ways of splitting the dataset and fitting the base classifiers of the pool.
- Split in train/test (e.g., 75/25) and then split the training in train/dsel (e.g., 50/50).
In this random forest example, the RF is fitted on the 75% training portion and the DS methods on the 50% DSEL portion. - In all the other examples, the 50% training portion is used to fit the classifier and the 50% DSEL portion is used to fit DS methods.
Furthermore, I wanted to point out this tip taken from the tutorial :
An important point here is that in case of small datasets or when the base classifier models in the pool are weak estimators such as Decision Stumps or Perceptrons, an overlap between the training data and DSEL may be beneficial for achieving better performance.
That seems my case, as my dataset is rather small compared to most datasets in the ML domain. Hence, I was thinking of fitting my base classifiers on the 75% part and then leveraging some overlap to get better performance (and this is really the case! In fact, overlapping leads to a median auc of 0.76 whereas non-overlapping gives 0.71).
What would be the best way of dealing with the problem ?