diff --git a/docs/zh/api/arch.md b/docs/zh/api/arch.md index e69bf9e75..8a5d0729f 100644 --- a/docs/zh/api/arch.md +++ b/docs/zh/api/arch.md @@ -21,7 +21,6 @@ - FNO1d - Generator - HEDeepONets - - KAN - LorenzEmbedding - MLP - ModelList @@ -38,8 +37,5 @@ - USCNN - LNO - TGCN - - RegDGCNN - - RegPointNet - - IFMMLP show_root_heading: true heading_level: 3 diff --git a/docs/zh/examples/perovskite_solar_cells_nn.md b/docs/zh/examples/perovskite_solar_cells_nn.md new file mode 100644 index 000000000..24d6e0d80 --- /dev/null +++ b/docs/zh/examples/perovskite_solar_cells_nn.md @@ -0,0 +1,173 @@ +# psc_NN(Machine Learning for Perovskite Solar Cells: An Open-Source Pipeline) + +!!! note "注意事项" + + 1. 开始训练前,请确保数据集已正确放置在 `data/cleaned/` 目录下。 + 2. 训练和评估需要安装额外的依赖包,请使用 `pip install -r requirements.txt` 安装。 + 3. 为获得最佳性能,建议使用 GPU 进行训练。 + +=== "模型训练命令" + + ``` sh + python psc_nn.py mode=train + ``` + +=== "模型评估命令" + + ``` sh + # 使用本地预训练模型 + python psc_nn.py mode=eval eval.pretrained_model_path="Your pdparams path" + ``` + + ``` sh + # 或使用远程预训练模型 + python psc_nn.py mode=eval eval.pretrained_model_path="https://paddle-org.bj.bcebos.com/paddlescience/models/PerovskiteSolarCells/solar_cell_pretrained.pdparams" + ``` + +| 预训练模型 | 指标 | +|:--| :--| +| [solar_cell_pretrained.pdparams](https://paddle-org.bj.bcebos.com/paddlescience/models/PerovskiteSolarCells/solar_cell_pretrained.pdparams) | RMSE: 3.91798 | + +## 1. 背景简介 + +太阳能电池是一种通过光电效应将光能直接转换为电能的关键能源器件,其性能预测是优化和设计太阳能电池的重要环节。然而,传统的性能预测方法往往依赖于复杂的物理模拟和大量的实验测试,不仅成本高昂,且耗时较长,制约了研究与开发的效率。 + +近年来,深度学习和机器学习技术的快速发展,为太阳能电池性能预测提供了创新的方法。通过机器学习技术,可以显著加快开发速度,同时实现与实验结果相当的预测精度。特别是在钙钛矿太阳能电池研究中,材料的化学组成和结构多样性为模型训练带来了新的挑战。为了解决这一问题,研究者们通常将材料的特性转换为固定长度的特征向量,以适配机器学习模型。尽管如此,不同性能指标的特征表示设计仍需不断优化,同时对模型预测结果的可解释性要求也更为严格。 + +本研究中,通过利用包含钙钛矿太阳能电池特性信息的全面数据库(PDP),我们构建并评估了包括 XGBoost、psc_nn 在内的多种机器学习模型,专注于预测短路电流密度(Jsc)。研究结果表明,结合深度学习与超参数优化工具(如 Optuna)能够显著提升太阳能电池设计的效率,为新型太阳能电池研发提供了更精确且高效的解决方案。 + +## 2. 模型原理 + +本章节仅对太阳能电池性能预测模型的原理进行简单地介绍,详细的理论推导请阅读 [Machine Learning for Perovskite Solar Cells: An Open-Source Pipeline](https://onlinelibrary.wiley.com/doi/10.1002/apxr.202400060)。 + +该方法的主要思想是通过人工神经网络建立光谱响应数据与短路电流密度(Jsc)之间的非线性映射关系。人工神经网络模型的总体结构如下图所示: + +![psc_nn_overview](psc_nn_overview.png) + +本案例采用多层感知机(MLP)作为基础模型架构,主要包括以下几个部分: + +1. 输入层:接收 2808 维的光谱响应数据 +2. 隐藏层:4-6 层全连接层,每层的神经元数量通过 Optuna 优化 +3. 激活函数:使用 ReLU 激活函数引入非线性特性 +4. 输出层:输出预测的 Jsc 值 + +通过这种方式,我们可以自动找到最适合当前任务的模型配置,提高模型的预测性能。 + +## 3. 模型实现 + +本章节我们讲解如何基于 PaddleScience 代码实现钙钛矿太阳能电池性能预测模型。本案例结合 Optuna 框架进行超参数优化,并使用 PaddleScience 内置的各种功能模块。为了快速理解 PaddleScience,接下来仅对模型构建、约束构建、评估器构建等关键步骤进行阐述,而其余细节请参考 [API文档](../api/arch.md)。 + +### 3.1 数据集介绍 + +本案例使用的数据集包含 Perovskite Database Project(PDP) 数据。数据集分为以下几个部分: + +1. 训练集: + - 特征数据:`data/cleaned/training.csv` + - 标签数据:`data/cleaned/training_labels.csv` +2. 验证集: + - 特征数据:`data/cleaned/validation.csv` + - 标签数据:`data/cleaned/validation_labels.csv` + +为了方便数据处理,我们实现了一个辅助函数 `create_tensor_dict` 来创建输入和标签的 tensor 字典: + +``` py linenums="36" title="examples/perovskite_solar_cells/psc_nn.py" +--8<-- +examples/perovskite_solar_cells/psc_nn.py:36:42 +--8<-- +``` + +数据集的读取和预处理代码如下: + +``` py linenums="123" title="examples/perovskite_solar_cells/psc_nn.py" +--8<-- +examples/perovskite_solar_cells/psc_nn.py:123:143 +--8<-- +``` + +为了进行超参数优化,我们将训练集进一步划分为训练集和验证集: + +``` py linenums="135" title="examples/perovskite_solar_cells/psc_nn.py" +--8<-- +examples/perovskite_solar_cells/psc_nn.py:135:140 +--8<-- +``` + +### 3.2 模型构建 + +本案例使用 PaddleScience 内置的 `ppsci.arch.MLP` 构建多层感知机模型。模型的超参数通过 Optuna 框架进行优化,主要包括: + +1. 网络层数:4-6层 +2. 每层神经元数量:10-input_dim/2 +3. 激活函数:ReLU +4. 输入维度:2808(光谱响应数据维度) +5. 输出维度:1(Jsc 预测值) + +模型定义代码如下: + +``` py linenums="104" title="examples/perovskite_solar_cells/psc_nn.py" +--8<-- +examples/perovskite_solar_cells/psc_nn.py:104:121 +--8<-- +``` + +### 3.3 损失函数设计 + +考虑到数据集中不同样本的重要性可能不同,我们设计了一个加权均方误差损失函数。该函数对较大的 Jsc 值赋予更高的权重,以提高模型在高性能太阳能电池上的预测准确性: + +``` py linenums="24" title="examples/perovskite_solar_cells/psc_nn.py" +--8<-- +examples/perovskite_solar_cells/psc_nn.py:24:34 +--8<-- +``` + +### 3.4 约束构建 + +本案例基于数据驱动的方法求解问题,因此使用 PaddleScience 内置的 `SupervisedConstraint` 构建监督约束。为了减少代码重复,我们实现了 `create_constraint` 函数来创建监督约束: + +``` py linenums="44" title="examples/perovskite_solar_cells/psc_nn.py" +--8<-- +examples/perovskite_solar_cells/psc_nn.py:44:64 +--8<-- +``` + +### 3.5 评估器构建 + +为了实时监测模型的训练情况,我们实现了 `create_validator` 函数来创建评估器: + +``` py linenums="66" title="examples/perovskite_solar_cells/psc_nn.py" +--8<-- +examples/perovskite_solar_cells/psc_nn.py:66:82 +--8<-- +``` + +### 3.6 优化器构建 + +为了统一管理优化器和学习率调度器的创建,我们实现了 `create_optimizer` 函数: + +``` py linenums="84" title="examples/perovskite_solar_cells/psc_nn.py" +--8<-- +examples/perovskite_solar_cells/psc_nn.py:84:102 +--8<-- +``` + +### 3.7 模型训练与评估 + +在训练过程中,我们使用上述封装的函数来创建数据字典、约束、评估器和优化器: + +``` py linenums="202" title="examples/perovskite_solar_cells/psc_nn.py" +--8<-- +examples/perovskite_solar_cells/psc_nn.py:202:215 +--8<-- +``` + +## 4. 完整代码 + +``` py linenums="1" title="examples/perovskite_solar_cells/psc_nn.py" +--8<-- +examples/perovskite_solar_cells/psc_nn.py +--8<-- +``` + +## 5. 参考文献 + +- [Machine Learning for Perovskite Solar Cells: An Open-Source Pipeline](https://onlinelibrary.wiley.com/doi/10.1002/apxr.202400060) diff --git a/docs/zh/examples/psc_nn_overview.png b/docs/zh/examples/psc_nn_overview.png new file mode 100644 index 000000000..f8afd9641 Binary files /dev/null and b/docs/zh/examples/psc_nn_overview.png differ diff --git a/examples/perovskite_solar_cells/conf/psc_nn.yaml b/examples/perovskite_solar_cells/conf/psc_nn.yaml new file mode 100644 index 000000000..2b8ba88f0 --- /dev/null +++ b/examples/perovskite_solar_cells/conf/psc_nn.yaml @@ -0,0 +1,60 @@ +defaults: + - ppsci_default + - TRAIN: train_default + - TRAIN/ema: ema_default + - TRAIN/swa: swa_default + - EVAL: eval_default + - INFER: infer_default + - hydra/job/config/override_dirname/exclude_keys: exclude_keys_default + - _self_ + +hydra: + run: + dir: outputs_allen_cahn_piratenet/${now:%Y-%m-%d}/${now:%H-%M-%S}/${hydra.job.override_dirname} + job: + name: ${mode} + chdir: false + callbacks: + init_callback: + _target_: ppsci.utils.callbacks.InitCallback + sweep: + dir: ${hydra.run.dir} + subdir: ./ + +mode: "train" +seed: 42 +output_dir: ${hydra:run.dir} + +data: + train_features_path: "./data/cleaned/training.csv" + train_labels_path: "./data/cleaned/training_labels.csv" + val_features_path: "./data/cleaned/validation.csv" + val_labels_path: "./data/cleaned/validation_labels.csv" + +model: + num_layers: 4 + hidden_size: [128, 96, 64, 32] + activation: "relu" + input_dim: 2808 + output_dim: 1 + +TRAIN: + epochs: 10 + search_epochs: 3 + batch_size: 64 + learning_rate: 0.001 + eval_during_train: true + eval_freq: 5 + save_freq: 10 + log_freq: 50 + lr_scheduler: + gamma: 0.95 + decay_steps: 5 + warmup_epoch: 2 + warmup_start_lr: 1.0e-6 + +eval: + batch_size: 64 + eval_with_no_grad: true + pretrained_model_path: null + log_freq: 50 diff --git a/examples/perovskite_solar_cells/psc_nn.py b/examples/perovskite_solar_cells/psc_nn.py new file mode 100644 index 000000000..92cac686d --- /dev/null +++ b/examples/perovskite_solar_cells/psc_nn.py @@ -0,0 +1,364 @@ +import os +from os import path as osp + +import hydra +import numpy as np +import optuna +import paddle +import pandas as pd +from matplotlib import pyplot as plt +from omegaconf import DictConfig +from sklearn.metrics import mean_absolute_percentage_error +from sklearn.metrics import mean_squared_error +from sklearn.metrics import r2_score +from sklearn.model_selection import train_test_split + +import ppsci +from ppsci.constraint import SupervisedConstraint +from ppsci.optimizer import lr_scheduler +from ppsci.optimizer import optimizer +from ppsci.solver import Solver +from ppsci.validate import SupervisedValidator + + +def weighted_loss(output_dict, target_dict, weight_dict=None): + pred = output_dict["target"] + true = target_dict["target"] + epsilon = 1e-06 + n = len(true) + weights = true / (paddle.sum(x=true) + epsilon) + squared = (true - pred) ** 2 + weighted = squared * weights + loss = paddle.sum(x=weighted) / n + return {"weighted_mse": loss} + + +def create_tensor_dict(X, y): + """Create Tensor Dictionary for Input and Labels""" + return { + "input": paddle.to_tensor(X.values, dtype="float32"), + "label": {"target": paddle.to_tensor(y.values, dtype="float32")}, + } + + +def create_constraint(input_dict, batch_size, shuffle=True): + """Create supervision constraints""" + return SupervisedConstraint( + dataloader_cfg={ + "dataset": { + "name": "NamedArrayDataset", + "input": {"input": input_dict["input"]}, + "label": input_dict["label"], + }, + "batch_size": batch_size, + "sampler": { + "name": "BatchSampler", + "drop_last": False, + "shuffle": shuffle, + }, + }, + loss=weighted_loss, + output_expr={"target": lambda out: out["target"]}, + name="train_constraint", + ) + + +def create_validator(input_dict, batch_size, name="validator"): + """Create an evaluator""" + return SupervisedValidator( + dataloader_cfg={ + "dataset": { + "name": "NamedArrayDataset", + "input": {"input": input_dict["input"]}, + "label": input_dict["label"], + }, + "batch_size": batch_size, + }, + loss=weighted_loss, + output_expr={"target": lambda out: out["target"]}, + metric={"RMSE": ppsci.metric.RMSE(), "MAE": ppsci.metric.MAE()}, + name=name, + ) + + +def create_optimizer(model, optimizer_name, lr, train_cfg, data_size): + """Create optimizer and learning rate scheduler""" + schedule = lr_scheduler.ExponentialDecay( + epochs=train_cfg.epochs, + iters_per_epoch=data_size // train_cfg.batch_size, + learning_rate=lr, + gamma=train_cfg.lr_scheduler.gamma, + decay_steps=train_cfg.lr_scheduler.decay_steps, + warmup_epoch=train_cfg.lr_scheduler.warmup_epoch, + warmup_start_lr=train_cfg.lr_scheduler.warmup_start_lr, + )() + + if optimizer_name == "Adam": + return optimizer.Adam(learning_rate=schedule)(model) + elif optimizer_name == "RMSProp": + return optimizer.RMSProp(learning_rate=schedule)(model) + else: + return optimizer.SGD(learning_rate=schedule)(model) + + +def define_model(trial, input_dim, output_dim): + n_layers = trial.suggest_int("n_layers", 4, 6) + hidden_sizes = [] + for i in range(n_layers): + out_features = trial.suggest_int(f"n_units_l{i}", 10, input_dim // 2) + hidden_sizes.append(out_features) + + model = ppsci.arch.MLP( + input_keys=("input",), + output_keys=("target",), + num_layers=None, + hidden_size=hidden_sizes, + activation="relu", + input_dim=input_dim, + output_dim=output_dim, + ) + return model + + +def train(cfg: DictConfig): + # Read and preprocess data + X_train = pd.read_csv(cfg.data.train_features_path) + y_train = pd.read_csv(cfg.data.train_labels_path) + X_val = pd.read_csv(cfg.data.val_features_path) + y_val = pd.read_csv(cfg.data.val_labels_path) + + for col in X_train.columns: + if "[" in col or "]" in col: + old_name = col + new_name = col.replace("[", "(").replace("]", ")") + X_train = X_train.rename(columns={old_name: new_name}) + X_val = X_val.rename(columns={old_name: new_name}) + + X_train, X_verif, y_train, y_verif = train_test_split( + X_train, y_train, test_size=0.1, random_state=42 + ) + + for df in [X_train, y_train, X_verif, y_verif, X_val, y_val]: + df.reset_index(drop=True, inplace=True) + + def objective(trial): + model = define_model(trial, cfg.model.input_dim, cfg.model.output_dim) + + optimizer_name = trial.suggest_categorical( + "optimizer", ["Adam", "RMSProp", "SGD"] + ) + lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True) + + train_dict = create_tensor_dict(X_train, y_train) + verif_dict = create_tensor_dict(X_verif, y_verif) + + opt = create_optimizer(model, optimizer_name, lr, cfg.TRAIN, len(X_train)) + + train_constraint = create_constraint(train_dict, cfg.TRAIN.batch_size) + verif_validator = create_validator( + verif_dict, cfg.eval.batch_size, "verif_validator" + ) + + solver = Solver( + model=model, + constraint={"train": train_constraint}, + optimizer=opt, + validator={"verif": verif_validator}, + output_dir=cfg.output_dir, + epochs=cfg.TRAIN.search_epochs, + iters_per_epoch=len(X_train) // cfg.TRAIN.batch_size, + eval_during_train=cfg.TRAIN.eval_during_train, + eval_freq=cfg.TRAIN.eval_freq, + save_freq=cfg.TRAIN.save_freq, + eval_with_no_grad=cfg.eval.eval_with_no_grad, + log_freq=cfg.TRAIN.log_freq, + ) + + solver.train() + + verif_preds = solver.predict({"input": verif_dict["input"]}, return_numpy=True)[ + "target" + ] + + verif_rmse = np.sqrt(mean_squared_error(y_verif.values, verif_preds)) + + return verif_rmse + + study = optuna.create_study() + study.optimize(objective, n_trials=50) + + best_params = study.best_trial.params + print("\nBest hyperparameters: " + str(best_params)) + + # Save the optimal model structure + hidden_sizes = [] + for i in range(best_params["n_layers"]): + hidden_sizes.append(best_params[f"n_units_l{i}"]) + + # Create and train the final model + final_model = define_model( + study.best_trial, cfg.model.input_dim, cfg.model.output_dim + ) + opt = create_optimizer( + final_model, + best_params["optimizer"], + best_params["lr"], + cfg.TRAIN, + len(X_train), + ) + + train_dict = create_tensor_dict(X_train, y_train) + val_dict = create_tensor_dict(X_val, y_val) + + train_constraint = create_constraint(train_dict, cfg.TRAIN.batch_size) + val_validator = create_validator(val_dict, cfg.eval.batch_size, "val_validator") + + solver = Solver( + model=final_model, + constraint={"train": train_constraint}, + optimizer=opt, + validator={"valid": val_validator}, + output_dir=cfg.output_dir, + epochs=cfg.TRAIN.epochs, + iters_per_epoch=len(X_train) // cfg.TRAIN.batch_size, + eval_during_train=cfg.TRAIN.eval_during_train, + eval_freq=cfg.TRAIN.eval_freq, + save_freq=cfg.TRAIN.save_freq, + eval_with_no_grad=cfg.eval.eval_with_no_grad, + log_freq=cfg.TRAIN.log_freq, + ) + + solver.train() + + # Save model structure and weights + model_dict = { + "state_dict": final_model.state_dict(), + "hidden_size": hidden_sizes, + "n_layers": best_params["n_layers"], + "optimizer": best_params["optimizer"], + "lr": best_params["lr"], + } + paddle.save( + model_dict, os.path.join(cfg.output_dir, "checkpoints", "best_model.pdparams") + ) + print( + "Saved model structure and weights to " + + os.path.join(cfg.output_dir, "checkpoints", "best_model.pdparams") + ) + + solver.plot_loss_history(by_epoch=True, smooth_step=1) + solver.eval() + + visualize_results(solver, X_val, y_val, cfg.output_dir) + + +def evaluate(cfg: DictConfig): + # Read and preprocess data + X_val = pd.read_csv(cfg.data.val_features_path) + y_val = pd.read_csv(cfg.data.val_labels_path) + + for col in X_val.columns: + if "[" in col or "]" in col: + old_name = col + new_name = col.replace("[", "(").replace("]", ")") + X_val = X_val.rename(columns={old_name: new_name}) + + # Loading model structure and weights + print(f"Loading model from {cfg.eval.pretrained_model_path}") + model_dict = paddle.load(cfg.eval.pretrained_model_path) + hidden_size = model_dict["hidden_size"] + print(f"Loaded model structure with hidden sizes: {hidden_size}") + + model = ppsci.arch.MLP( + input_keys=("input",), + output_keys=("target",), + num_layers=None, + hidden_size=hidden_size, + activation="relu", + input_dim=cfg.model.input_dim, + output_dim=cfg.model.output_dim, + ) + + # Load model weights + model.set_state_dict(model_dict["state_dict"]) + print("Successfully loaded model weights") + + valid_dict = create_tensor_dict(X_val, y_val) + valid_validator = create_validator( + valid_dict, cfg.eval.batch_size, "valid_validator" + ) + + solver = Solver( + model=model, + output_dir=cfg.output_dir, + validator={"valid": valid_validator}, + eval_with_no_grad=cfg.eval.eval_with_no_grad, + ) + + # evaluation model + print("Evaluating model...") + solver.eval() + + # Generate prediction results + predictions = solver.predict({"input": valid_dict["input"]}, return_numpy=True)[ + "target" + ] + + # Calculate multiple evaluation indicators + rmse = np.sqrt(mean_squared_error(y_val.values, predictions)) + r2 = r2_score(y_val.values, predictions) + mape = mean_absolute_percentage_error(y_val.values, predictions) + + print("Evaluation metrics:") + print(f"RMSE: {rmse:.5f}") + print(f"R2 Score: {r2:.5f}") + print(f"MAPE: {mape:.5f}") + + # Visualization results + print("Generating visualization...") + visualize_results(solver, X_val, y_val, cfg.output_dir) + print("Evaluation completed.") + + +def visualize_results(solver, X_val, y_val, output_dir): + pred_dict = solver.predict( + {"input": paddle.to_tensor(X_val.values, dtype="float32")}, return_numpy=True + ) + val_preds = pred_dict["target"] + val_true = y_val.values + + plt.figure(figsize=(10, 6)) + plt.grid(True, linestyle="--", alpha=0.7) + plt.hist(val_true, bins=30, alpha=0.6, label="True Jsc", color="tab:blue") + plt.hist(val_preds, bins=30, alpha=0.6, label="Predicted Jsc", color="orange") + + pred_mean = np.mean(val_preds) + pred_std = np.std(val_preds) + plt.axvline(pred_mean, color="black", linestyle="--") + plt.axvline(pred_mean + pred_std, color="red", linestyle="--") + plt.axvline(pred_mean - pred_std, color="red", linestyle="--") + + val_rmse = np.sqrt(mean_squared_error(val_true, val_preds)) + plt.title(f"Distribution of True Jsc vs Pred Jsc: RMSE {val_rmse:.5f}", pad=20) + plt.xlabel("Jsc (mA/cm²)") + plt.ylabel("Counts") + plt.legend(fontsize=10) + plt.tight_layout() + plt.savefig( + osp.join(output_dir, "jsc_distribution.png"), dpi=300, bbox_inches="tight" + ) + plt.close() + + +@hydra.main(version_base=None, config_path="./conf", config_name="psc_nn.yaml") +def main(cfg: DictConfig): + if cfg.mode == "train": + train(cfg) + elif cfg.mode == "eval": + evaluate(cfg) + else: + raise ValueError(f"cfg.mode should in ['train', 'eval'], but got '{cfg.mode}'") + + +if __name__ == "__main__": + main() diff --git a/examples/perovskite_solar_cells/psc_nn_overview.png b/examples/perovskite_solar_cells/psc_nn_overview.png new file mode 100644 index 000000000..f8afd9641 Binary files /dev/null and b/examples/perovskite_solar_cells/psc_nn_overview.png differ diff --git a/examples/perovskite_solar_cells/requirements.txt b/examples/perovskite_solar_cells/requirements.txt new file mode 100644 index 000000000..a45c8c6d7 --- /dev/null +++ b/examples/perovskite_solar_cells/requirements.txt @@ -0,0 +1,10 @@ +paddlepaddle-gpu>=3.0.0 +paddlesci>=0.0.1 +numpy>=1.26.0 +pandas>=2.2.0 +matplotlib>=3.9.0 +scikit-learn>=1.4.0 +hydra-core>=1.3.0 +omegaconf>=2.3.0 +optuna>=4.0.0 +h5py>=3.12.0 \ No newline at end of file diff --git a/examples/perovskite_solar_cells/solar_cell_pretrained.pdparams b/examples/perovskite_solar_cells/solar_cell_pretrained.pdparams new file mode 100644 index 000000000..b22937a19 Binary files /dev/null and b/examples/perovskite_solar_cells/solar_cell_pretrained.pdparams differ diff --git a/mkdocs.yml b/mkdocs.yml index 7c5fbf09f..22a17b9ad 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -95,6 +95,7 @@ nav: - 材料科学(AI for Material): - hPINNs: zh/examples/hpinns.md - CGCNN: zh/examples/cgcnn.md + - psc_NN: zh/examples/perovskite_solar_cells_nn.md - 地球科学(AI for Earth Science): - Extformer-MoE: zh/examples/extformer_moe.md - FourCastNet: zh/examples/fourcastnet.md