Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sklearn compatible interface. #3

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.10']
python-version: ['3.12']

steps:
- uses: actions/checkout@v3
Expand All @@ -24,6 +24,7 @@ jobs:
conda config --set solver libmamba
python -m pip install tox-conda
python -m pip install flake8
python -m pip install setuptools
- name: Linting with flake8
run: |
flake8
Expand Down
115 changes: 35 additions & 80 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ _AlpineGP_ is a Python library for **symbolic regression** via _Genetic Programm
It provides a high-level interface to the [`DEAP`](https://github.com/alucantonio/DEAP)
library, including distributed computing functionalities.

Besides solving classical symbolic regression problems involving algebraic equations
Besides solving classical symbolic regression problems involving _algebraic equations_
(see, for example, the benchmark problems contained in the
[SRBench](https://github.com/cavalab/srbench) repository), _AlpineGP_ is specifically
design to help identifying _symbolic_ models of _physical systems_ governed by **field equations**.
designed to help identifying _symbolic_ models of _physical systems_ governed by **field equations**.
To this aim, it allows to exploit the **discrete calculus** framework defined and implemented in the library
[`dctkit`](https://github.com/alucantonio/dctkit) as a natural and effective language to express physical models
(i.e., conservation laws).
Expand All @@ -24,7 +24,7 @@ elastica). Scripts to reproduce these benchmarks can be found [here](https://git
- scikit-learn compatible interface;
- hyperparameter configuration via YAML files;
- support for custom operators (with/without strong-typing);
- benchmark suite (Nguyen and interface to SRBench)
- benchmark suite (Nguyen and SRBench)

## Installation

Expand Down Expand Up @@ -67,112 +67,72 @@ $ ./bench.sh
```
Then process the results using the `process_results` notebook.

Results on [PMLB](https://epistasislab.github.io/pmlb/) datasets (average $R^2$ over 10
test sets, no Friedman):

| dataset | mean | median | std |
|:------------------------------|-----------:|-----------:|-----------:|
| 527_analcatdata_election2000 | 0.997727 | 0.999273 | 0.00357541 |
| 663_rabe_266 | 0.994945 | 0.995115 | 0.00134602 |
| 560_bodyfat | 0.988467 | 0.992938 | 0.0121634 |
| 505_tecator | 0.986861 | 0.986026 | 0.0039009 |
| 561_cpu | 0.957349 | 0.967161 | 0.0330056 |
| 690_visualizing_galaxy | 0.963404 | 0.964137 | 0.00867664 |
| 197_cpu_act | 0.94309 | 0.945666 | 0.00966613 |
| 227_cpu_small | 0.946096 | 0.945094 | 0.00812824 |
| 523_analcatdata_neavote | 0.936577 | 0.943564 | 0.0278365 |
| 1096_FacultySalaries | 0.662191 | 0.894004 | 0.525012 |
| 557_analcatdata_apnea1 | 0.881416 | 0.889496 | 0.0397044 |
| 230_machine_cpu | 0.778943 | 0.879675 | 0.273846 |
| 556_analcatdata_apnea2 | 0.863157 | 0.867148 | 0.0347729 |
| 1027_ESL | 0.858838 | 0.860647 | 0.0127587 |
| 695_chatfield_4 | 0.827457 | 0.830825 | 0.0677194 |
| 229_pwLinear | 0.810944 | 0.811717 | 0.0453826 |
| 210_cloud | 0.761678 | 0.786611 | 0.159399 |
| 529_pollen | 0.787219 | 0.782358 | 0.0118861 |
| 1089_USCrime | 0.739218 | 0.756442 | 0.117112 |
| 503_wind | 0.747271 | 0.745787 | 0.0088297 |
| 712_chscase_geyser1 | 0.751443 | 0.745605 | 0.0549794 |
| 519_vinnie | 0.728873 | 0.719948 | 0.0377254 |
| 228_elusage | 0.621403 | 0.714127 | 0.216677 |
| 659_sleuth_ex1714 | 0.562146 | 0.702428 | 0.309503 |
| 666_rmftsa_ladata | 0.679718 | 0.672306 | 0.0620477 |
| 225_puma8NH | 0.66854 | 0.667771 | 0.0127414 |
| 706_sleuth_case1202 | 0.418764 | 0.568134 | 0.43742 |
| 1029_LEV | 0.557169 | 0.560547 | 0.0330229 |
| 547_no2 | 0.50562 | 0.502983 | 0.0920748 |
| 485_analcatdata_vehicle | 0.244083 | 0.47083 | 0.702171 |
| 192_vineyard | 0.381856 | 0.38018 | 0.200867 |
| 1030_ERA | 0.373955 | 0.373216 | 0.0453621 |
| 1028_SWD | 0.335559 | 0.343532 | 0.0556771 |
| 542_pollution | 0.170091 | 0.279329 | 0.254557 |
| 665_sleuth_case2002 | 0.242165 | 0.25769 | 0.146767 |
| 522_pm10 | 0.235107 | 0.233109 | 0.0445476 |
| 678_visualizing_environmental | 0.0604016 | 0.193514 | 0.358373 |
| 687_sleuth_ex1605 | -0.0707247 | -0.0740387 | 0.372597 |

**Median test $R^2$: 0.7683**.

## Usage

Setting up a symbolic regression problem in _AlpineGP_ involves several key steps:

1. Define the function that computes the prediction associated to an _individual_
(model expression tree). Its arguments may be a _function_ obtained by parsing the
individual tree and possibly other parameters, such as the dataset needed to evaluate
the model. It returns both an _error metric_ between the prediction and the data and
the prediction itself.
individual tree and possibly other parameters, such as the features (`X`) needed to evaluate
the model. It returns both the error between the predictions and the labels (`y`) and
the predictions themselves.
```python
def eval_MSE_sol(individual, dataset):
def eval_MSE_sol(individual, X, y):

# ...
return MSE, prediction
```

1. Define the functions that return the **prediction** and the **fitness**
associated to an individual. These functions **must** have the same
arguments. In particular:
- the first argument is **always** the batch of trees to be evaluated by the
current worker;
- the second argument **must** be the `toolbox` object used to compile the
individual trees into callable functions;
- the third argument **must** be the dataset needed for the evaluation of the
individuals.
Both functions **must** be decorated with `ray.remote` to support
distributed evaluation (multiprocessing).
2. Define the functions that return the **prediction** and the **fitness**
associated to an individual. These functions **must** have at least the following
arguments in the first three positions:
- the list of trees to be evaluated by the current worker;
- the `toolbox` object used to compile the individual trees into callable functions;
- the dataset features needed for the evaluation of the individuals. The name of the
argument **must** be `X`.

Additionally, the fourth argument of the **fitness** function **must** be the dataset
labels, called `y`. For unsupervised problems, `None` can be passed for the labels to
the `fit` method of the regressor. Both functions **must** be decorated with `ray.remote` to support
distributed evaluation (multiprocessing). Any additional arguments can be set using
the `common_data` argument of the `GPSymbolicRegressor` object (see below).
```python
@ray.remote
def predict(trees, toolbox, data):
def predict(trees, toolbox, X):

callables = compile_individuals(toolbox, trees)

preds = [None]*len(trees)

for i, ind in enumerate(callables):
_, preds[i] = eval_MSE_sol(ind, data)
_, preds[i] = eval_MSE_sol(ind, X, None)

return preds

@ray.remote
def fitness(trees, toolbox, true_data):
def fitness(trees, toolbox, X, y):
callables = compile_individuals(toolbox, trees)

fitnesses = [None]*len(trees)

for i, ind in enumerate(callables):
MSE, _ = eval_MSE_sol(ind, data)
MSE, _ = eval_MSE_sol(ind, X, y)

# each fitness MUST be a tuple (required by DEAP)
fitnesses[i] = (MSE,)

return fitnesses
```

3. Set and solve the symbolic regression problem.
3. Set up and solve the symbolic regression problem. The configuration of the
`GPSymbolicRegressor` object can be specified via the arguments of its constructor
(see the API docs), or loaded from a YAML file.
```python
# read parameters from YAML file
with open("ex1.yaml") as config_file:
config_file_data = yaml.safe_load(config_file)
# read config parameters from YAML file
yamlfile = "ex1.yaml"
filename = os.path.join(os.path.dirname(__file__), yamlfile)

regressor_params, config_file_data = util.load_config_data(filename)

# ...
# ...
Expand All @@ -192,18 +152,13 @@ common_params = {'penalty': penalty}
gpsr = gps.GPSymbolicRegressor(pset=pset, fitness=fitness.remote,
predict_func=predict.remote, common_data=common_params,
print_log=True,
config_file_data=config_file_data)

# wrap tensors corresponding to train and test data into Dataset objects (to be passed to
# fit and predict methods)
train_data = Dataset("D", X_train, y_train)
test_data = Dataset("D", X_test, y_test)
**regressor_params)

# solve the symbolic regression problem
gpsr.fit(train_data)
gpsr.fit(X_train, y_train)

# compute the prediction on the test dataset given by the best model found during the SR
pred_test = gpsr.predict(test_data)
pred_test = gpsr.predict(X_test)
```

A complete example notebook can be found in the `examples` directory. Also check the
Expand Down
Loading
Loading