@@ -6,10 +6,10 @@ _AlpineGP_ is a Python library for **symbolic regression** via _Genetic Programm
6
6
It provides a high-level interface to the [ ` DEAP ` ] ( https://github.com/alucantonio/DEAP )
7
7
library, including distributed computing functionalities.
8
8
9
- Besides solving classical symbolic regression problems involving algebraic equations
9
+ Besides solving classical symbolic regression problems involving _ algebraic equations _
10
10
(see, for example, the benchmark problems contained in the
11
11
[ SRBench] ( https://github.com/cavalab/srbench ) repository), _ AlpineGP_ is specifically
12
- design to help identifying _ symbolic_ models of _ physical systems_ governed by ** field equations** .
12
+ designed to help identifying _ symbolic_ models of _ physical systems_ governed by ** field equations** .
13
13
To this aim, it allows to exploit the ** discrete calculus** framework defined and implemented in the library
14
14
[ ` dctkit ` ] ( https://github.com/alucantonio/dctkit ) as a natural and effective language to express physical models
15
15
(i.e., conservation laws).
@@ -24,7 +24,7 @@ elastica). Scripts to reproduce these benchmarks can be found [here](https://git
24
24
- scikit-learn compatible interface;
25
25
- hyperparameter configuration via YAML files;
26
26
- support for custom operators (with/without strong-typing);
27
- - benchmark suite (Nguyen and interface to SRBench)
27
+ - benchmark suite (Nguyen and SRBench)
28
28
29
29
## Installation
30
30
@@ -67,112 +67,71 @@ $ ./bench.sh
67
67
```
68
68
Then process the results using the ` process_results ` notebook.
69
69
70
- Results on [ PMLB] ( https://epistasislab.github.io/pmlb/ ) datasets (average $R^2$ over 10
71
- test sets, no Friedman):
72
-
73
- | dataset | mean | median | std |
74
- | :------------------------------| -----------:| -----------:| -----------:|
75
- | 527_analcatdata_election2000 | 0.997727 | 0.999273 | 0.00357541 |
76
- | 663_rabe_266 | 0.994945 | 0.995115 | 0.00134602 |
77
- | 560_bodyfat | 0.988467 | 0.992938 | 0.0121634 |
78
- | 505_tecator | 0.986861 | 0.986026 | 0.0039009 |
79
- | 561_cpu | 0.957349 | 0.967161 | 0.0330056 |
80
- | 690_visualizing_galaxy | 0.963404 | 0.964137 | 0.00867664 |
81
- | 197_cpu_act | 0.94309 | 0.945666 | 0.00966613 |
82
- | 227_cpu_small | 0.946096 | 0.945094 | 0.00812824 |
83
- | 523_analcatdata_neavote | 0.936577 | 0.943564 | 0.0278365 |
84
- | 1096_FacultySalaries | 0.662191 | 0.894004 | 0.525012 |
85
- | 557_analcatdata_apnea1 | 0.881416 | 0.889496 | 0.0397044 |
86
- | 230_machine_cpu | 0.778943 | 0.879675 | 0.273846 |
87
- | 556_analcatdata_apnea2 | 0.863157 | 0.867148 | 0.0347729 |
88
- | 1027_ESL | 0.858838 | 0.860647 | 0.0127587 |
89
- | 695_chatfield_4 | 0.827457 | 0.830825 | 0.0677194 |
90
- | 229_pwLinear | 0.810944 | 0.811717 | 0.0453826 |
91
- | 210_cloud | 0.761678 | 0.786611 | 0.159399 |
92
- | 529_pollen | 0.787219 | 0.782358 | 0.0118861 |
93
- | 1089_USCrime | 0.739218 | 0.756442 | 0.117112 |
94
- | 503_wind | 0.747271 | 0.745787 | 0.0088297 |
95
- | 712_chscase_geyser1 | 0.751443 | 0.745605 | 0.0549794 |
96
- | 519_vinnie | 0.728873 | 0.719948 | 0.0377254 |
97
- | 228_elusage | 0.621403 | 0.714127 | 0.216677 |
98
- | 659_sleuth_ex1714 | 0.562146 | 0.702428 | 0.309503 |
99
- | 666_rmftsa_ladata | 0.679718 | 0.672306 | 0.0620477 |
100
- | 225_puma8NH | 0.66854 | 0.667771 | 0.0127414 |
101
- | 706_sleuth_case1202 | 0.418764 | 0.568134 | 0.43742 |
102
- | 1029_LEV | 0.557169 | 0.560547 | 0.0330229 |
103
- | 547_no2 | 0.50562 | 0.502983 | 0.0920748 |
104
- | 485_analcatdata_vehicle | 0.244083 | 0.47083 | 0.702171 |
105
- | 192_vineyard | 0.381856 | 0.38018 | 0.200867 |
106
- | 1030_ERA | 0.373955 | 0.373216 | 0.0453621 |
107
- | 1028_SWD | 0.335559 | 0.343532 | 0.0556771 |
108
- | 542_pollution | 0.170091 | 0.279329 | 0.254557 |
109
- | 665_sleuth_case2002 | 0.242165 | 0.25769 | 0.146767 |
110
- | 522_pm10 | 0.235107 | 0.233109 | 0.0445476 |
111
- | 678_visualizing_environmental | 0.0604016 | 0.193514 | 0.358373 |
112
- | 687_sleuth_ex1605 | -0.0707247 | -0.0740387 | 0.372597 |
113
-
114
- ** Median test $R^2$: 0.7683** .
115
-
116
70
## Usage
117
71
118
72
Setting up a symbolic regression problem in _ AlpineGP_ involves several key steps:
119
73
120
74
1 . Define the function that computes the prediction associated to an _ individual_
121
75
(model expression tree). Its arguments may be a _ function_ obtained by parsing the
122
- individual tree and possibly other parameters, such as the dataset needed to evaluate
123
- the model. It returns both an _ error metric _ between the prediction and the data and
124
- the prediction itself .
76
+ individual tree and possibly other parameters, such as the features ( ` X ` ) needed to evaluate
77
+ the model. It returns both the error between the predictions and the labels ( ` y ` ) and
78
+ the predictions themselves .
125
79
``` python
126
- def eval_MSE_sol (individual , dataset ):
80
+ def eval_MSE_sol (individual , X , y ):
127
81
128
82
# ...
129
83
return MSE , prediction
130
84
```
131
85
132
- 1 . Define the functions that return the ** prediction** and the ** fitness**
133
- associated to an individual. These functions ** must** have the same
134
- arguments. In particular :
135
- - the first argument is ** always ** the batch of trees to be evaluated by the
136
- current worker ;
137
- - the second argument ** must ** be the ` toolbox ` object used to compile the
138
- individual trees into callable functions;
139
- - the third argument ** must ** be the dataset needed for the evaluation of the
140
- individuals .
86
+ 2 . Define the functions that return the ** prediction** and the ** fitness**
87
+ associated to an individual. These functions ** must** have at least the following
88
+ arguments in the first three positions :
89
+ - the list of trees to be evaluated by the current worker;
90
+ - the ` toolbox ` object used to compile the individual trees into callable functions ;
91
+ - the dataset features needed for the evaluation of the individuals. The name of the argument ** must ** be ` X ` .
92
+ Additionally, the fourth argument of the ** fitness ** function ** must ** be the dataset
93
+ labels, called ` y ` . For unsupervised problems, ` None ` can be passed for the labels to the ` fit `
94
+ method of the regressor .
141
95
Both functions ** must** be decorated with ` ray.remote ` to support
142
- distributed evaluation (multiprocessing).
96
+ distributed evaluation (multiprocessing). Any additional arguments can be set using
97
+ the ` common_data ` argument of the ` GPSymbolicRegressor ` object (see below).
143
98
``` python
144
99
@ray.remote
145
- def predict (trees , toolbox , data ):
100
+ def predict (trees , toolbox , X ):
146
101
147
102
callables = compile_individuals(toolbox, trees)
148
103
149
104
preds = [None ]* len (trees)
150
105
151
106
for i, ind in enumerate (callables):
152
- _, preds[i] = eval_MSE_sol(ind, data )
107
+ _, preds[i] = eval_MSE_sol(ind, X, None )
153
108
154
109
return preds
155
110
156
111
@ray.remote
157
- def fitness (trees , toolbox , true_data ):
112
+ def fitness (trees , toolbox , X , y ):
158
113
callables = compile_individuals(toolbox, trees)
159
114
160
115
fitnesses = [None ]* len (trees)
161
116
162
117
for i, ind in enumerate (callables):
163
- MSE , _ = eval_MSE_sol(ind, data )
118
+ MSE , _ = eval_MSE_sol(ind, X, y )
164
119
165
120
# each fitness MUST be a tuple (required by DEAP)
166
121
fitnesses[i] = (MSE ,)
167
122
168
123
return fitnesses
169
124
```
170
125
171
- 3 . Set and solve the symbolic regression problem.
126
+ 3 . Set up and solve the symbolic regression problem. The configuration of the
127
+ ` GPSymbolicRegressor ` object can be specified via the arguments of its constructor
128
+ (see the API docs), or loaded from a YAML file.
172
129
``` python
173
- # read parameters from YAML file
174
- with open (" ex1.yaml" ) as config_file:
175
- config_file_data = yaml.safe_load(config_file)
130
+ # read config parameters from YAML file
131
+ yamlfile = " ex1.yaml"
132
+ filename = os.path.join(os.path.dirname(__file__ ), yamlfile)
133
+
134
+ regressor_params, config_file_data = util.load_config_data(filename)
176
135
177
136
# ...
178
137
# ...
@@ -192,18 +151,13 @@ common_params = {'penalty': penalty}
192
151
gpsr = gps.GPSymbolicRegressor(pset = pset, fitness = fitness.remote,
193
152
predict_func = predict.remote, common_data = common_params,
194
153
print_log = True ,
195
- config_file_data = config_file_data)
196
-
197
- # wrap tensors corresponding to train and test data into Dataset objects (to be passed to
198
- # fit and predict methods)
199
- train_data = Dataset(" D" , X_train, y_train)
200
- test_data = Dataset(" D" , X_test, y_test)
154
+ ** regressor_params)
201
155
202
156
# solve the symbolic regression problem
203
- gpsr.fit(train_data )
157
+ gpsr.fit(X_train, y_train )
204
158
205
159
# compute the prediction on the test dataset given by the best model found during the SR
206
- pred_test = gpsr.predict(test_data )
160
+ pred_test = gpsr.predict(X_test )
207
161
```
208
162
209
163
A complete example notebook can be found in the ` examples ` directory. Also check the
0 commit comments