Skip to content

Commit 2574a08

Browse files
author
Alessandro Lucantonio
committed
Updated README.
1 parent 118a52a commit 2574a08

File tree

7 files changed

+49
-102
lines changed

7 files changed

+49
-102
lines changed

README.md

+33-79
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ _AlpineGP_ is a Python library for **symbolic regression** via _Genetic Programm
66
It provides a high-level interface to the [`DEAP`](https://github.com/alucantonio/DEAP)
77
library, including distributed computing functionalities.
88

9-
Besides solving classical symbolic regression problems involving algebraic equations
9+
Besides solving classical symbolic regression problems involving _algebraic equations_
1010
(see, for example, the benchmark problems contained in the
1111
[SRBench](https://github.com/cavalab/srbench) repository), _AlpineGP_ is specifically
12-
design to help identifying _symbolic_ models of _physical systems_ governed by **field equations**.
12+
designed to help identifying _symbolic_ models of _physical systems_ governed by **field equations**.
1313
To this aim, it allows to exploit the **discrete calculus** framework defined and implemented in the library
1414
[`dctkit`](https://github.com/alucantonio/dctkit) as a natural and effective language to express physical models
1515
(i.e., conservation laws).
@@ -24,7 +24,7 @@ elastica). Scripts to reproduce these benchmarks can be found [here](https://git
2424
- scikit-learn compatible interface;
2525
- hyperparameter configuration via YAML files;
2626
- support for custom operators (with/without strong-typing);
27-
- benchmark suite (Nguyen and interface to SRBench)
27+
- benchmark suite (Nguyen and SRBench)
2828

2929
## Installation
3030

@@ -67,112 +67,71 @@ $ ./bench.sh
6767
```
6868
Then process the results using the `process_results` notebook.
6969

70-
Results on [PMLB](https://epistasislab.github.io/pmlb/) datasets (average $R^2$ over 10
71-
test sets, no Friedman):
72-
73-
| dataset | mean | median | std |
74-
|:------------------------------|-----------:|-----------:|-----------:|
75-
| 527_analcatdata_election2000 | 0.997727 | 0.999273 | 0.00357541 |
76-
| 663_rabe_266 | 0.994945 | 0.995115 | 0.00134602 |
77-
| 560_bodyfat | 0.988467 | 0.992938 | 0.0121634 |
78-
| 505_tecator | 0.986861 | 0.986026 | 0.0039009 |
79-
| 561_cpu | 0.957349 | 0.967161 | 0.0330056 |
80-
| 690_visualizing_galaxy | 0.963404 | 0.964137 | 0.00867664 |
81-
| 197_cpu_act | 0.94309 | 0.945666 | 0.00966613 |
82-
| 227_cpu_small | 0.946096 | 0.945094 | 0.00812824 |
83-
| 523_analcatdata_neavote | 0.936577 | 0.943564 | 0.0278365 |
84-
| 1096_FacultySalaries | 0.662191 | 0.894004 | 0.525012 |
85-
| 557_analcatdata_apnea1 | 0.881416 | 0.889496 | 0.0397044 |
86-
| 230_machine_cpu | 0.778943 | 0.879675 | 0.273846 |
87-
| 556_analcatdata_apnea2 | 0.863157 | 0.867148 | 0.0347729 |
88-
| 1027_ESL | 0.858838 | 0.860647 | 0.0127587 |
89-
| 695_chatfield_4 | 0.827457 | 0.830825 | 0.0677194 |
90-
| 229_pwLinear | 0.810944 | 0.811717 | 0.0453826 |
91-
| 210_cloud | 0.761678 | 0.786611 | 0.159399 |
92-
| 529_pollen | 0.787219 | 0.782358 | 0.0118861 |
93-
| 1089_USCrime | 0.739218 | 0.756442 | 0.117112 |
94-
| 503_wind | 0.747271 | 0.745787 | 0.0088297 |
95-
| 712_chscase_geyser1 | 0.751443 | 0.745605 | 0.0549794 |
96-
| 519_vinnie | 0.728873 | 0.719948 | 0.0377254 |
97-
| 228_elusage | 0.621403 | 0.714127 | 0.216677 |
98-
| 659_sleuth_ex1714 | 0.562146 | 0.702428 | 0.309503 |
99-
| 666_rmftsa_ladata | 0.679718 | 0.672306 | 0.0620477 |
100-
| 225_puma8NH | 0.66854 | 0.667771 | 0.0127414 |
101-
| 706_sleuth_case1202 | 0.418764 | 0.568134 | 0.43742 |
102-
| 1029_LEV | 0.557169 | 0.560547 | 0.0330229 |
103-
| 547_no2 | 0.50562 | 0.502983 | 0.0920748 |
104-
| 485_analcatdata_vehicle | 0.244083 | 0.47083 | 0.702171 |
105-
| 192_vineyard | 0.381856 | 0.38018 | 0.200867 |
106-
| 1030_ERA | 0.373955 | 0.373216 | 0.0453621 |
107-
| 1028_SWD | 0.335559 | 0.343532 | 0.0556771 |
108-
| 542_pollution | 0.170091 | 0.279329 | 0.254557 |
109-
| 665_sleuth_case2002 | 0.242165 | 0.25769 | 0.146767 |
110-
| 522_pm10 | 0.235107 | 0.233109 | 0.0445476 |
111-
| 678_visualizing_environmental | 0.0604016 | 0.193514 | 0.358373 |
112-
| 687_sleuth_ex1605 | -0.0707247 | -0.0740387 | 0.372597 |
113-
114-
**Median test $R^2$: 0.7683**.
115-
11670
## Usage
11771

11872
Setting up a symbolic regression problem in _AlpineGP_ involves several key steps:
11973

12074
1. Define the function that computes the prediction associated to an _individual_
12175
(model expression tree). Its arguments may be a _function_ obtained by parsing the
122-
individual tree and possibly other parameters, such as the dataset needed to evaluate
123-
the model. It returns both an _error metric_ between the prediction and the data and
124-
the prediction itself.
76+
individual tree and possibly other parameters, such as the features (`X`) needed to evaluate
77+
the model. It returns both the error between the predictions and the labels (`y`) and
78+
the predictions themselves.
12579
```python
126-
def eval_MSE_sol(individual, dataset):
80+
def eval_MSE_sol(individual, X, y):
12781

12882
# ...
12983
return MSE, prediction
13084
```
13185

132-
1. Define the functions that return the **prediction** and the **fitness**
133-
associated to an individual. These functions **must** have the same
134-
arguments. In particular:
135-
- the first argument is **always** the batch of trees to be evaluated by the
136-
current worker;
137-
- the second argument **must** be the `toolbox` object used to compile the
138-
individual trees into callable functions;
139-
- the third argument **must** be the dataset needed for the evaluation of the
140-
individuals.
86+
2. Define the functions that return the **prediction** and the **fitness**
87+
associated to an individual. These functions **must** have at least the following
88+
arguments in the first three positions:
89+
- the list of trees to be evaluated by the current worker;
90+
- the `toolbox` object used to compile the individual trees into callable functions;
91+
- the dataset features needed for the evaluation of the individuals. The name of the argument **must** be `X`.
92+
Additionally, the fourth argument of the **fitness** function **must** be the dataset
93+
labels, called `y`. For unsupervised problems, `None` can be passed for the labels to the `fit`
94+
method of the regressor.
14195
Both functions **must** be decorated with `ray.remote` to support
142-
distributed evaluation (multiprocessing).
96+
distributed evaluation (multiprocessing). Any additional arguments can be set using
97+
the `common_data` argument of the `GPSymbolicRegressor` object (see below).
14398
```python
14499
@ray.remote
145-
def predict(trees, toolbox, data):
100+
def predict(trees, toolbox, X):
146101

147102
callables = compile_individuals(toolbox, trees)
148103

149104
preds = [None]*len(trees)
150105

151106
for i, ind in enumerate(callables):
152-
_, preds[i] = eval_MSE_sol(ind, data)
107+
_, preds[i] = eval_MSE_sol(ind, X, None)
153108

154109
return preds
155110

156111
@ray.remote
157-
def fitness(trees, toolbox, true_data):
112+
def fitness(trees, toolbox, X, y):
158113
callables = compile_individuals(toolbox, trees)
159114

160115
fitnesses = [None]*len(trees)
161116

162117
for i, ind in enumerate(callables):
163-
MSE, _ = eval_MSE_sol(ind, data)
118+
MSE, _ = eval_MSE_sol(ind, X, y)
164119

165120
# each fitness MUST be a tuple (required by DEAP)
166121
fitnesses[i] = (MSE,)
167122

168123
return fitnesses
169124
```
170125

171-
3. Set and solve the symbolic regression problem.
126+
3. Set up and solve the symbolic regression problem. The configuration of the
127+
`GPSymbolicRegressor` object can be specified via the arguments of its constructor
128+
(see the API docs), or loaded from a YAML file.
172129
```python
173-
# read parameters from YAML file
174-
with open("ex1.yaml") as config_file:
175-
config_file_data = yaml.safe_load(config_file)
130+
# read config parameters from YAML file
131+
yamlfile = "ex1.yaml"
132+
filename = os.path.join(os.path.dirname(__file__), yamlfile)
133+
134+
regressor_params, config_file_data = util.load_config_data(filename)
176135

177136
# ...
178137
# ...
@@ -192,18 +151,13 @@ common_params = {'penalty': penalty}
192151
gpsr = gps.GPSymbolicRegressor(pset=pset, fitness=fitness.remote,
193152
predict_func=predict.remote, common_data=common_params,
194153
print_log=True,
195-
config_file_data=config_file_data)
196-
197-
# wrap tensors corresponding to train and test data into Dataset objects (to be passed to
198-
# fit and predict methods)
199-
train_data = Dataset("D", X_train, y_train)
200-
test_data = Dataset("D", X_test, y_test)
154+
**regressor_params)
201155

202156
# solve the symbolic regression problem
203-
gpsr.fit(train_data)
157+
gpsr.fit(X_train, y_train)
204158

205159
# compute the prediction on the test dataset given by the best model found during the SR
206-
pred_test = gpsr.predict(test_data)
160+
pred_test = gpsr.predict(X_test)
207161
```
208162

209163
A complete example notebook can be found in the `examples` directory. Also check the

bench/results/process_results.ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -374,7 +374,7 @@
374374
"name": "python",
375375
"nbconvert_exporter": "python",
376376
"pygments_lexer": "ipython3",
377-
"version": "3.12.8"
377+
"version": "3.12.5"
378378
}
379379
},
380380
"nbformat": 4,

examples/simple_sr.py

+6-7
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import os
22
from deap import gp
33
from alpine.gp.regressor import GPSymbolicRegressor
4-
from alpine.data import Dataset
54
import numpy as np
65
import ray
76
import warnings
@@ -51,33 +50,33 @@ def eval_MSE_sol(individual, X, y):
5150

5251

5352
@ray.remote
54-
def predict(individuals_str, toolbox, X_test, penalty):
53+
def predict(individuals_str, toolbox, X, penalty):
5554

5655
callables = compile_individuals(toolbox, individuals_str)
5756

5857
u = [None] * len(individuals_str)
5958

6059
for i, ind in enumerate(callables):
61-
_, u[i] = eval_MSE_sol(ind, X_test, None)
60+
_, u[i] = eval_MSE_sol(ind, X, None)
6261

6362
return u
6463

6564

6665
@ray.remote
67-
def score(individuals_str, toolbox, X_test, y_test, penalty):
66+
def score(individuals_str, toolbox, X, y, penalty):
6867

6968
callables = compile_individuals(toolbox, individuals_str)
7069

7170
MSE = [None] * len(individuals_str)
7271

7372
for i, ind in enumerate(callables):
74-
MSE[i], _ = eval_MSE_sol(ind, X_test, y_test)
73+
MSE[i], _ = eval_MSE_sol(ind, X, y)
7574

7675
return MSE
7776

7877

7978
@ray.remote
80-
def fitness(individuals_str, toolbox, X_train, y_train, penalty):
79+
def fitness(individuals_str, toolbox, X, y, penalty):
8180
callables = compile_individuals(toolbox, individuals_str)
8281

8382
individ_length, nested_trigs, num_trigs = get_features_batch(individuals_str)
@@ -87,7 +86,7 @@ def fitness(individuals_str, toolbox, X_train, y_train, penalty):
8786
if individ_length[i] >= 50:
8887
fitnesses[i] = (1e8,)
8988
else:
90-
MSE, _ = eval_MSE_sol(ind, X_train, y_train)
89+
MSE, _ = eval_MSE_sol(ind, X, y)
9190

9291
fitnesses[i] = (
9392
MSE

examples/simple_sr_noyaml.py

+6-7
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
from deap import gp
22
from alpine.gp.regressor import GPSymbolicRegressor
3-
from alpine.data import Dataset
43
import numpy as np
54
import ray
65
import warnings
@@ -50,33 +49,33 @@ def eval_MSE_sol(individual, X, y):
5049

5150

5251
@ray.remote
53-
def predict(individuals_str, toolbox, X_test, penalty):
52+
def predict(individuals_str, toolbox, X, penalty):
5453

5554
callables = compile_individuals(toolbox, individuals_str)
5655

5756
u = [None] * len(individuals_str)
5857

5958
for i, ind in enumerate(callables):
60-
_, u[i] = eval_MSE_sol(ind, X_test, None)
59+
_, u[i] = eval_MSE_sol(ind, X, None)
6160

6261
return u
6362

6463

6564
@ray.remote
66-
def score(individuals_str, toolbox, X_test, y_test, penalty):
65+
def score(individuals_str, toolbox, X, y, penalty):
6766

6867
callables = compile_individuals(toolbox, individuals_str)
6968

7069
MSE = [None] * len(individuals_str)
7170

7271
for i, ind in enumerate(callables):
73-
MSE[i], _ = eval_MSE_sol(ind, X_test, y_test)
72+
MSE[i], _ = eval_MSE_sol(ind, X, y)
7473

7574
return MSE
7675

7776

7877
@ray.remote
79-
def fitness(individuals_str, toolbox, X_train, y_train, penalty):
78+
def fitness(individuals_str, toolbox, X, y, penalty):
8079
callables = compile_individuals(toolbox, individuals_str)
8180

8281
individ_length, nested_trigs, num_trigs = get_features_batch(individuals_str)
@@ -86,7 +85,7 @@ def fitness(individuals_str, toolbox, X_train, y_train, penalty):
8685
if individ_length[i] >= 50:
8786
fitnesses[i] = (1e8,)
8887
else:
89-
MSE, _ = eval_MSE_sol(ind, X_train, y_train)
88+
MSE, _ = eval_MSE_sol(ind, X, y)
9089

9190
fitnesses[i] = (
9291
MSE

src/alpine/gp/regressor.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -738,7 +738,7 @@ def save_train_fit_history(self, output_path: str):
738738
if self.validate:
739739
np.save(join(output_path, "val_fit_history.npy"), self.val_fit_history)
740740

741-
def save_best_test_sols(self, test_data: Dataset, output_path: str):
741+
def save_best_test_sols(self, X_test, output_path: str):
742742
"""Compute and save the predictions corresponding to the best individual
743743
at the end of the evolution, evaluated over the test dataset.
744744
@@ -747,7 +747,7 @@ def save_best_test_sols(self, test_data: Dataset, output_path: str):
747747
output_path: path where the predictions should be saved (one .npy file for
748748
each sample in the test dataset).
749749
"""
750-
best_test_sols = self.predict(test_data)
750+
best_test_sols = self.predict(X_test)
751751

752752
for i, sol in enumerate(best_test_sols):
753753
np.save(join(output_path, "best_sol_test_" + str(i) + ".npy"), sol)

tests/test_basic_sr.py

-4
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
from dctkit import config
33
from deap import gp
44
from alpine.gp.regressor import GPSymbolicRegressor
5-
from alpine.data import Dataset
65
from alpine.gp import util
76
import jax.numpy as jnp
87
import ray
@@ -113,13 +112,10 @@ def test_basic_sr(set_test_dir):
113112
**regressor_params
114113
)
115114

116-
# train_data = Dataset("true_data", x, y)
117115
gpsr.fit(x, y)
118116

119117
fit_score = gpsr.score(x, y)
120118

121-
y_pred = gpsr.predict(x)
122-
123119
ray.shutdown()
124120

125121
assert fit_score <= 1e-12

tests/test_poisson1d.py

+1-2
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44
from dctkit.math.opt import optctrl as oc
55
from deap import gp
66
from alpine.gp import regressor as gps
7-
from alpine.data import Dataset
87
from dctkit import config
98
import dctkit
109
import numpy as np
@@ -229,7 +228,7 @@ def test_poisson1d(set_test_dir, yamlfile):
229228

230229
fit_score = gpsr.score(X_train, y_train)
231230

232-
# gpsr.save_best_test_sols(train_data, "./")
231+
gpsr.save_best_test_sols(X_train, "./")
233232

234233
ray.shutdown()
235234
assert np.allclose(u.coeffs.flatten(), np.ravel(u_best))

0 commit comments

Comments
 (0)