Skip to content

Proper way of fitting classifiers before creating an heterogeneous pool #264

@francescopisu

Description

@francescopisu

Hey, I'm working on a research paper focused on building a binary classification model in the biomedical domain.The dataset comprises approximately 800 data points. Let's say I want to feed an heterogeneous pool of classifiers to the dynamic selection methods. By following the instructions on the examples, I've found two different ways of splitting the dataset and fitting the base classifiers of the pool.

  1. Split in train/test (e.g., 75/25) and then split the training in train/dsel (e.g., 50/50).
    In this random forest example, the RF is fitted on the 75% training portion and the DS methods on the 50% DSEL portion.
  2. In all the other examples, the 50% training portion is used to fit the classifier and the 50% DSEL portion is used to fit DS methods.

Furthermore, I wanted to point out this tip taken from the tutorial :

An important point here is that in case of small datasets or when the base classifier models in the pool are weak estimators such as Decision Stumps or Perceptrons, an overlap between the training data and DSEL may be beneficial for achieving better performance.

That seems my case, as my dataset is rather small compared to most datasets in the ML domain. Hence, I was thinking of fitting my base classifiers on the 75% part and then leveraging some overlap to get better performance (and this is really the case! In fact, overlapping leads to a median auc of 0.76 whereas non-overlapping gives 0.71).

What would be the best way of dealing with the problem ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions