Updated README.

alucantonio · alucantonio · commit 2549faf45a31 · 2024-05-08T09:32:54.000+02:00
diff --git a/README.md b/README.md
@@ -2,12 +2,20 @@
 
 # AlpineGP
 
-_AlpineGP_ is a Python library that helps to build algorithms that can identify _symbolic_ models
-of _physical systems_ starting from data. It performs **symbolic regression** using a
-_strongly-typed genetic programming_ approach implemented in the [`DEAP`](https://github.com/alucantonio/DEAP)
-library. As a natural language for expressing physical models, it leverages the
-**discrete calculus** framework
-defined and implemented in the library [`dctkit`](https://github.com/alucantonio/dctkit).
+_AlpineGP_ is a Python library that solves **symbolic regression** problems using
+_Genetic Programming_. It provides a high-level interface to the
+[`DEAP`](https://github.com/alucantonio/DEAP) library and leverages the high-performance
+distributed computing functionalities provided by the [`ray`](https://www.ray.io) library. 
+
+Beside solving classical symbolic regression problems involving algebraic equations
+(see, for example, the benchmark problems contained in the
+[SRBench](https://github.com/cavalab/srbench)
+repository), _AlpineGP_ is specifically design to help identifying _interpretable_, 
+_symbolic_ models of _physical systems_ starting from data. To this aim, it exploits as a natural and
+effective language to express physical models (i.e., conservation laws) a **discrete
+calculus** framework, including tools from discrete differential geometry and discrete
+exterior calculus, defined and implemented in the library
+[`dctkit`](https://github.com/alucantonio/dctkit).
 
 _AlpineGP_ has been introduced in the paper [_Discovering interpretable physical models
 with symbolic regression and discrete exterior calculus_](https://iopscience.iop.org/article/10.1088/2632-2153/ad1af2),
@@ -51,40 +59,55 @@ $ tox -e docs
 
 Setting up a symbolic regression problem in _AlpineGP_ involves several key steps:
 
-1. Define the function that computes the prediction associated to an _individual_ (model expression tree).
-Its arguments are a _function_ obtained by parsing the individual tree and possibly other
-parameters (datasets to compare the individual with). It returns both an _error metric_ between
-the prediction and the data and the prediction itself. 
+1. Define the function that computes the prediction associated to an _individual_
+(model expression tree). Its arguments may be a _function_ obtained by parsing the
+individual tree and possibly other parameters, such as the dataset needed to evaluate
+the model. It returns both an _error metric_ between the prediction and the data and
+the prediction itself. 
 ```python
-def eval_MSE_sol(individual: Callable, D: Dataset):
+def eval_MSE_sol(individual, dataset):
 
     # ...
     return MSE, prediction
 ```
 
-2. Define the functions that return the **prediction** and the **fitness** 
+1. Define the functions that return the **prediction** and the **fitness** 
    associated to an individual. These functions **must** have the same
-   arguments. The first argument is **always** the `Callable` that represents the
-   individual tree. The functions **must** be decorated with `ray.remote` to support
+   arguments. In particular:
+   - the first argument is **always** the batch of trees to be evaluated by the
+     current worker;
+   - the second argument **must** be the `toolbox` object used to compile the 
+     individual trees into callable functions;
+   - the third argument **must** be the dataset needed for the evaluation of the
+     individuals.
+   Both functions **must** be decorated with `ray.remote` to support
    distributed evaluation (multiprocessing).
 ```python
 @ray.remote
-def predict(individual: Callable, indlen: int, D: Dataset, penalty: float) -> float:
+def predict(trees, toolbox, data):
 
-    _, pred = eval_MSE_sol(individual, D)
+    callables = compile_individuals(toolbox, trees)
 
-    return pred
+    preds = [None]*len(trees)
+
+    for i, ind in enumerate(callables):
+        _, preds[i] = eval_MSE_sol(ind, data)
+
+    return preds
 
 @ray.remote
-def fitness(individual: Callable, length: int, D: Dataset, penalty: float) -> Tuple[float, ]:
+def fitness(trees, toolbox, true_data):
+    callables = compile_individuals(toolbox, trees)
 
-    MSE, _ = eval_MSE_sol(individual, D)
+    fitnesses = [None]*len(trees)
 
-    # add penalty on length of the tree to promote simpler solutions
-    fitness = MSE + penalty*length
+    for i, ind in enumerate(callables):
+        MSE, _ = eval_MSE_sol(ind, data)
+        
+        # each fitness MUST be a tuple (required by DEAP)
+        fitnesses[i] = (MSE,)
 
-    # return value MUST be a tuple
-    return fitness,
+    return fitnesses
 ```
 
 3. Set and solve the symbolic regression problem.
@@ -110,22 +133,19 @@ common_params = {'penalty': penalty}
 # create the Symbolic Regression Problem object
 gpsr = gps.GPSymbolicRegressor(pset=pset, fitness=fitness.remote,
                                predict_func=predict.remote, common_data=common_params,
-                               feature_extractors=[len],
                                print_log=True, 
                                config_file_data=config_file_data)
 
-# define training Dataset object (to be used for model fitting)
+# wrap tensors corresponding to train and test data into Dataset objects (to be passed to
+# fit and predict methods)
 train_data = Dataset("D", X_train, y_train)
+test_data = Dataset("D", X_test, y_test)
 
 # solve the symbolic regression problem
 gpsr.fit(train_data)
 
-# recover the solution associated to the best individual among all the populations
-u_best = gpsr.predict(train_data)
-
-# plot the solution
-# ...
-# ...
+# compute the prediction on the test dataset given by the best model found during the SR
+pred_test = gpsr.predict(test_data)
 ```
 
 A complete example notebook can be found in the `examples` directory.