Skip to content

PyGAD gene_space cannot accept numpy linspace dtype=int  #140

Open
@jdmoore7

Description

@jdmoore7

I am trying to use PyGAD to optimize hyper-parameters in ML models. According to documentation

The gene_space parameter customizes the space of values of each gene ... list, tuple, numpy.ndarray, or any range like range, numpy.arange(), or numpy.linspace: It holds the space for each individual gene. But this space is usually discrete. That is there is a set of finite values to select from.

As you can see, the first element of gene_space, which corresponds to solution[0] in the Genetic Algorithm definition, is an array of integers. According to documentation, this should be a discrete space, which it is. However, when this array of integers (from np.linspace, which is okay to use), it is interpreted by Random Forest Classifier as a numpy.float64'> (see error in 3rd code block.)

I don't understand where this change of data type is occurring. Is this a PyGAD problem and how can I fix? Or is it a numpy -> sklearn problem?

gene_space = [ 
    # n_estimators
    np.linspace(50,200,25, dtype='int'),
    # min_samples_split, 
    np.linspace(2,10,5, dtype='int'),
    # min_samples_leaf,
    np.linspace(1,10,5, dtype='int'),
    # min_impurity_decrease
    np.linspace(0,1,10, dtype='float')
]

The definition of the Genetic Algorithm

def fitness_function_factory(data=data, y_name='y', sample_size=100):

    def fitness_function(solution, solution_idx):
        model = RandomForestClassifier(
            n_estimators=solution[0],
            min_samples_split=solution[1],
            min_samples_leaf=solution[2],
            min_impurity_decrease=solution[3]
        )
        
        X = data.drop(columns=[y_name])
        y = data[y_name]
        X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                            test_size=0.5)        

        train_idx = sample_without_replacement(n_population=len(X_train), 
                                              n_samples=sample_size)         
        
        test_idx = sample_without_replacement(n_population=len(X_test), 
                                              n_samples=sample_size) 
         
        model.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
        fitness = model.score(X_test.iloc[test_idx], y_test.iloc[test_idx])
        
        return fitness 

    return fitness_function

And the instantiation of the Genetic Algorithm

cross_validate = pygad.GA(gene_space=gene_space,
                      fitness_func=fitness_function_factory(),
                      num_generations=100,
                      num_parents_mating=2,
                      sol_per_pop=8,
                      num_genes=len(gene_space),
                      parent_selection_type='sss',
                      keep_parents=2,
                      crossover_type="single_point",
                      mutation_type="random",
                      mutation_percent_genes=25)

cross_validate.best_solution()
>>>
ValueError: n_estimators must be an integer, got <class 'numpy.float64'>.

Any recommendations on resolving this error?

EDIT: I've tried the below to successful results:

model = RandomForestClassifier(n_estimators=gene_space[0][0])
model.fit(X,y)

So the issue does not lie with numpy->sklearn but with PyGAD.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions