Updated README.

Alessandro Lucantonio · Alessandro Lucantonio · commit 2574a0800c6c · 2025-03-07T15:20:49.000+01:00
diff --git a/README.md b/README.md
@@ -6,10 +6,10 @@ _AlpineGP_ is a Python library for **symbolic regression** via _Genetic Programm
 It provides a high-level interface to the [`DEAP`](https://github.com/alucantonio/DEAP)
 library, including distributed computing functionalities.
 
-Besides solving classical symbolic regression problems involving algebraic equations
+Besides solving classical symbolic regression problems involving _algebraic equations_
 (see, for example, the benchmark problems contained in the
 [SRBench](https://github.com/cavalab/srbench) repository), _AlpineGP_ is specifically
-design to help identifying _symbolic_ models of _physical systems_ governed by **field equations**.
+designed to help identifying _symbolic_ models of _physical systems_ governed by **field equations**.
 To this aim, it allows to exploit the **discrete calculus** framework defined and implemented in the library
 [`dctkit`](https://github.com/alucantonio/dctkit) as a natural and effective language to express physical models
 (i.e., conservation laws).
@@ -24,7 +24,7 @@ elastica). Scripts to reproduce these benchmarks can be found [here](https://git
 - scikit-learn compatible interface;
 - hyperparameter configuration via YAML files;
 - support for custom operators (with/without strong-typing);
-- benchmark suite (Nguyen and interface to SRBench) 
+- benchmark suite (Nguyen and SRBench) 
 
 ## Installation
 
@@ -67,112 +67,71 @@ $ ./bench.sh
 ```
 Then process the results using the `process_results` notebook.
 
-Results on [PMLB](https://epistasislab.github.io/pmlb/) datasets (average $R^2$ over 10
-test sets, no Friedman):
-
-| dataset                       |       mean |     median |        std |
-|:------------------------------|-----------:|-----------:|-----------:|
-| 527_analcatdata_election2000  |  0.997727  |  0.999273  | 0.00357541 |
-| 663_rabe_266                  |  0.994945  |  0.995115  | 0.00134602 |
-| 560_bodyfat                   |  0.988467  |  0.992938  | 0.0121634  |
-| 505_tecator                   |  0.986861  |  0.986026  | 0.0039009  |
-| 561_cpu                       |  0.957349  |  0.967161  | 0.0330056  |
-| 690_visualizing_galaxy        |  0.963404  |  0.964137  | 0.00867664 |
-| 197_cpu_act                   |  0.94309   |  0.945666  | 0.00966613 |
-| 227_cpu_small                 |  0.946096  |  0.945094  | 0.00812824 |
-| 523_analcatdata_neavote       |  0.936577  |  0.943564  | 0.0278365  |
-| 1096_FacultySalaries          |  0.662191  |  0.894004  | 0.525012   |
-| 557_analcatdata_apnea1        |  0.881416  |  0.889496  | 0.0397044  |
-| 230_machine_cpu               |  0.778943  |  0.879675  | 0.273846   |
-| 556_analcatdata_apnea2        |  0.863157  |  0.867148  | 0.0347729  |
-| 1027_ESL                      |  0.858838  |  0.860647  | 0.0127587  |
-| 695_chatfield_4               |  0.827457  |  0.830825  | 0.0677194  |
-| 229_pwLinear                  |  0.810944  |  0.811717  | 0.0453826  |
-| 210_cloud                     |  0.761678  |  0.786611  | 0.159399   |
-| 529_pollen                    |  0.787219  |  0.782358  | 0.0118861  |
-| 1089_USCrime                  |  0.739218  |  0.756442  | 0.117112   |
-| 503_wind                      |  0.747271  |  0.745787  | 0.0088297  |
-| 712_chscase_geyser1           |  0.751443  |  0.745605  | 0.0549794  |
-| 519_vinnie                    |  0.728873  |  0.719948  | 0.0377254  |
-| 228_elusage                   |  0.621403  |  0.714127  | 0.216677   |
-| 659_sleuth_ex1714             |  0.562146  |  0.702428  | 0.309503   |
-| 666_rmftsa_ladata             |  0.679718  |  0.672306  | 0.0620477  |
-| 225_puma8NH                   |  0.66854   |  0.667771  | 0.0127414  |
-| 706_sleuth_case1202           |  0.418764  |  0.568134  | 0.43742    |
-| 1029_LEV                      |  0.557169  |  0.560547  | 0.0330229  |
-| 547_no2                       |  0.50562   |  0.502983  | 0.0920748  |
-| 485_analcatdata_vehicle       |  0.244083  |  0.47083   | 0.702171   |
-| 192_vineyard                  |  0.381856  |  0.38018   | 0.200867   |
-| 1030_ERA                      |  0.373955  |  0.373216  | 0.0453621  |
-| 1028_SWD                      |  0.335559  |  0.343532  | 0.0556771  |
-| 542_pollution                 |  0.170091  |  0.279329  | 0.254557   |
-| 665_sleuth_case2002           |  0.242165  |  0.25769   | 0.146767   |
-| 522_pm10                      |  0.235107  |  0.233109  | 0.0445476  |
-| 678_visualizing_environmental |  0.0604016 |  0.193514  | 0.358373   |
-| 687_sleuth_ex1605             | -0.0707247 | -0.0740387 | 0.372597   |
-
-**Median test $R^2$: 0.7683**.
-
 ## Usage
 
 Setting up a symbolic regression problem in _AlpineGP_ involves several key steps:
 
 1. Define the function that computes the prediction associated to an _individual_
 (model expression tree). Its arguments may be a _function_ obtained by parsing the
-individual tree and possibly other parameters, such as the dataset needed to evaluate
-the model. It returns both an _error metric_ between the prediction and the data and
-the prediction itself. 
+individual tree and possibly other parameters, such as the features (`X`) needed to evaluate
+the model. It returns both the error between the predictions and the labels (`y`) and
+the predictions themselves. 
 ```python
-def eval_MSE_sol(individual, dataset):
+def eval_MSE_sol(individual, X, y):
 
     # ...
     return MSE, prediction
 ```
 
-1. Define the functions that return the **prediction** and the **fitness** 
-   associated to an individual. These functions **must** have the same
-   arguments. In particular:
-   - the first argument is **always** the batch of trees to be evaluated by the
-     current worker;
-   - the second argument **must** be the `toolbox` object used to compile the 
-     individual trees into callable functions;
-   - the third argument **must** be the dataset needed for the evaluation of the
-     individuals.
+2. Define the functions that return the **prediction** and the **fitness** 
+   associated to an individual. These functions **must** have at least the following
+   arguments in the first three positions:
+   - the list of trees to be evaluated by the current worker;
+   - the `toolbox` object used to compile the individual trees into callable functions;
+   - the dataset features needed for the evaluation of the individuals. The name of the argument **must** be `X`.
+   Additionally, the fourth argument of the **fitness** function **must** be the dataset
+   labels, called `y`. For unsupervised problems, `None` can be passed for the labels to the `fit`
+   method of the regressor.
    Both functions **must** be decorated with `ray.remote` to support
-   distributed evaluation (multiprocessing).
+   distributed evaluation (multiprocessing). Any additional arguments can be set using
+   the `common_data` argument of the `GPSymbolicRegressor` object (see below). 
 ```python
 @ray.remote
-def predict(trees, toolbox, data):
+def predict(trees, toolbox, X):
 
     callables = compile_individuals(toolbox, trees)
 
     preds = [None]*len(trees)
 
     for i, ind in enumerate(callables):
-        _, preds[i] = eval_MSE_sol(ind, data)
+        _, preds[i] = eval_MSE_sol(ind, X, None)
 
     return preds
 
 @ray.remote
-def fitness(trees, toolbox, true_data):
+def fitness(trees, toolbox, X, y):
     callables = compile_individuals(toolbox, trees)
 
     fitnesses = [None]*len(trees)
 
     for i, ind in enumerate(callables):
-        MSE, _ = eval_MSE_sol(ind, data)
+        MSE, _ = eval_MSE_sol(ind, X, y)
         
         # each fitness MUST be a tuple (required by DEAP)
         fitnesses[i] = (MSE,)
 
     return fitnesses
 ```
 
-3. Set and solve the symbolic regression problem.
+3. Set up and solve the symbolic regression problem. The configuration of the
+   `GPSymbolicRegressor` object can be specified via the arguments of its constructor
+   (see the API docs), or loaded from a YAML file.
 ```python
-# read parameters from YAML file
-with open("ex1.yaml") as config_file:
-    config_file_data = yaml.safe_load(config_file)
+# read config parameters from YAML file
+yamlfile = "ex1.yaml"
+filename = os.path.join(os.path.dirname(__file__), yamlfile)
+
+regressor_params, config_file_data = util.load_config_data(filename)
 
 # ...
 # ...
@@ -192,18 +151,13 @@ common_params = {'penalty': penalty}
 gpsr = gps.GPSymbolicRegressor(pset=pset, fitness=fitness.remote,
                                predict_func=predict.remote, common_data=common_params,
                                print_log=True, 
-                               config_file_data=config_file_data)
-
-# wrap tensors corresponding to train and test data into Dataset objects (to be passed to
-# fit and predict methods)
-train_data = Dataset("D", X_train, y_train)
-test_data = Dataset("D", X_test, y_test)
+                               **regressor_params)
 
 # solve the symbolic regression problem
-gpsr.fit(train_data)
+gpsr.fit(X_train, y_train)
 
 # compute the prediction on the test dataset given by the best model found during the SR
-pred_test = gpsr.predict(test_data)
+pred_test = gpsr.predict(X_test)
 ```
 
 A complete example notebook can be found in the `examples` directory. Also check the
diff --git a/bench/results/process_results.ipynb b/bench/results/process_results.ipynb
@@ -374,7 +374,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.8"
+   "version": "3.12.5"
   }
  },
  "nbformat": 4,
diff --git a/examples/simple_sr.py b/examples/simple_sr.py
@@ -1,7 +1,6 @@
 import os
 from deap import gp
 from alpine.gp.regressor import GPSymbolicRegressor
-from alpine.data import Dataset
 import numpy as np
 import ray
 import warnings
@@ -51,33 +50,33 @@ def eval_MSE_sol(individual, X, y):
 
 
 @ray.remote
-def predict(individuals_str, toolbox, X_test, penalty):
+def predict(individuals_str, toolbox, X, penalty):
 
     callables = compile_individuals(toolbox, individuals_str)
 
     u = [None] * len(individuals_str)
 
     for i, ind in enumerate(callables):
-        _, u[i] = eval_MSE_sol(ind, X_test, None)
+        _, u[i] = eval_MSE_sol(ind, X, None)
 
     return u
 
 
 @ray.remote
-def score(individuals_str, toolbox, X_test, y_test, penalty):
+def score(individuals_str, toolbox, X, y, penalty):
 
     callables = compile_individuals(toolbox, individuals_str)
 
     MSE = [None] * len(individuals_str)
 
     for i, ind in enumerate(callables):
-        MSE[i], _ = eval_MSE_sol(ind, X_test, y_test)
+        MSE[i], _ = eval_MSE_sol(ind, X, y)
 
     return MSE
 
 
 @ray.remote
-def fitness(individuals_str, toolbox, X_train, y_train, penalty):
+def fitness(individuals_str, toolbox, X, y, penalty):
     callables = compile_individuals(toolbox, individuals_str)
 
     individ_length, nested_trigs, num_trigs = get_features_batch(individuals_str)
@@ -87,7 +86,7 @@ def fitness(individuals_str, toolbox, X_train, y_train, penalty):
         if individ_length[i] >= 50:
             fitnesses[i] = (1e8,)
         else:
-            MSE, _ = eval_MSE_sol(ind, X_train, y_train)
+            MSE, _ = eval_MSE_sol(ind, X, y)
 
             fitnesses[i] = (
                 MSE
diff --git a/examples/simple_sr_noyaml.py b/examples/simple_sr_noyaml.py
@@ -1,6 +1,5 @@
 from deap import gp
 from alpine.gp.regressor import GPSymbolicRegressor
-from alpine.data import Dataset
 import numpy as np
 import ray
 import warnings
@@ -50,33 +49,33 @@ def eval_MSE_sol(individual, X, y):
 
 
 @ray.remote
-def predict(individuals_str, toolbox, X_test, penalty):
+def predict(individuals_str, toolbox, X, penalty):
 
     callables = compile_individuals(toolbox, individuals_str)
 
     u = [None] * len(individuals_str)
 
     for i, ind in enumerate(callables):
-        _, u[i] = eval_MSE_sol(ind, X_test, None)
+        _, u[i] = eval_MSE_sol(ind, X, None)
 
     return u
 
 
 @ray.remote
-def score(individuals_str, toolbox, X_test, y_test, penalty):
+def score(individuals_str, toolbox, X, y, penalty):
 
     callables = compile_individuals(toolbox, individuals_str)
 
     MSE = [None] * len(individuals_str)
 
     for i, ind in enumerate(callables):
-        MSE[i], _ = eval_MSE_sol(ind, X_test, y_test)
+        MSE[i], _ = eval_MSE_sol(ind, X, y)
 
     return MSE
 
 
 @ray.remote
-def fitness(individuals_str, toolbox, X_train, y_train, penalty):
+def fitness(individuals_str, toolbox, X, y, penalty):
     callables = compile_individuals(toolbox, individuals_str)
 
     individ_length, nested_trigs, num_trigs = get_features_batch(individuals_str)
@@ -86,7 +85,7 @@ def fitness(individuals_str, toolbox, X_train, y_train, penalty):
         if individ_length[i] >= 50:
             fitnesses[i] = (1e8,)
         else:
-            MSE, _ = eval_MSE_sol(ind, X_train, y_train)
+            MSE, _ = eval_MSE_sol(ind, X, y)
 
             fitnesses[i] = (
                 MSE
diff --git a/src/alpine/gp/regressor.py b/src/alpine/gp/regressor.py
@@ -738,7 +738,7 @@ def save_train_fit_history(self, output_path: str):
         if self.validate:
             np.save(join(output_path, "val_fit_history.npy"), self.val_fit_history)
 
-    def save_best_test_sols(self, test_data: Dataset, output_path: str):
+    def save_best_test_sols(self, X_test, output_path: str):
         """Compute and save the predictions corresponding to the best individual
         at the end of the evolution, evaluated over the test dataset.
 
@@ -747,7 +747,7 @@ def save_best_test_sols(self, test_data: Dataset, output_path: str):
             output_path: path where the predictions should be saved (one .npy file for
                 each sample in the test dataset).
         """
-        best_test_sols = self.predict(test_data)
+        best_test_sols = self.predict(X_test)
 
         for i, sol in enumerate(best_test_sols):
             np.save(join(output_path, "best_sol_test_" + str(i) + ".npy"), sol)
diff --git a/tests/test_basic_sr.py b/tests/test_basic_sr.py
@@ -2,7 +2,6 @@
 from dctkit import config
 from deap import gp
 from alpine.gp.regressor import GPSymbolicRegressor
-from alpine.data import Dataset
 from alpine.gp import util
 import jax.numpy as jnp
 import ray
@@ -113,13 +112,10 @@ def test_basic_sr(set_test_dir):
         **regressor_params
     )
 
-    # train_data = Dataset("true_data", x, y)
     gpsr.fit(x, y)
 
     fit_score = gpsr.score(x, y)
 
-    y_pred = gpsr.predict(x)
-
     ray.shutdown()
 
     assert fit_score <= 1e-12
diff --git a/tests/test_poisson1d.py b/tests/test_poisson1d.py
@@ -4,7 +4,6 @@
 from dctkit.math.opt import optctrl as oc
 from deap import gp
 from alpine.gp import regressor as gps
-from alpine.data import Dataset
 from dctkit import config
 import dctkit
 import numpy as np
@@ -229,7 +228,7 @@ def test_poisson1d(set_test_dir, yamlfile):
 
     fit_score = gpsr.score(X_train, y_train)
 
-    # gpsr.save_best_test_sols(train_data, "./")
+    gpsr.save_best_test_sols(X_train, "./")
 
     ray.shutdown()
     assert np.allclose(u.coeffs.flatten(), np.ravel(u_best))

Original file line number	Diff line number	Diff line change
`@@ -374,7 +374,7 @@`
`374`	`374`	`"name": "python",`
`375`	`375`	`"nbconvert_exporter": "python",`
`376`	`376`	`"pygments_lexer": "ipython3",`
`377`		`- "version": "3.12.8"`
	`377`	`+ "version": "3.12.5"`
`378`	`378`	`}`
`379`	`379`	`},`
`380`	`380`	`"nbformat": 4,`