git clone <url>/hspn_surrogate_models
cd hspn_surrogate_models
pip install -e .
Used to preprocess data and create an H5 dataset for use by the models.
hspn-prepare data_dir=./data branch_files=[f_total.npy] trunk_files=[xyz.npy] output_files=[y_total.npy] output_path=./data/don_dataset.h5
Note: There are more options, use
--cfg=job
to see them and read the the CLI documentation below to learn how to use this CLI.
Corresponds to the structure:
data/
| f_total.npy
| xyz.npy
| y_total.npy
| don_dataset.h5 # created
hspn-train
Note: There are more options, use
--cfg=job
to see them and read the the CLI documentation below to learn how to use this CLI.
First, build the apptainer image with make hspn.sif
Next, parameterize a sweep by editing a configuration file (see train_hpo*.yaml
for examples)
Finally, launch...
ACCT=XXXXXXXX cluster/hpo-pbs.sh
See the PBS launch script for documentation on configuration options.
sbatch --account=XXXXXXXX cluster/hpo.slurm [<args>]
See the SLURM batch script for documentation on configuration options. Args can be passed to the train task as usual e.g.,
sbatch --account=XXXXXXXX cluster/hpo.slurm comm_backend=gloo n_epochs=100
The following applies to all CLI applications in hspn.
To see all available options:
# hspn-cli is a stand-in for any hspn cli invocation
hspn-<train/prepare/etc> ---help
hspn-<train/prepare/etc> --cfg=job # or --cfg=all
It is recommended to check the final config the job will execute with before running:
hspn-<train/prepare/etc> --cfg=job # or --cfg=all for verbose information
hspn-<train/prepare/etc> --cfg=job --resolve # causes variable references in the config to be resolved (resolving is always done at runtime, so this shows the final resolved config the job will use)
Each hspn CLI application can be invoked three ways. Using the prepare
application as an example:
- Directly:
python src/hspn/prepare.py
Cons: need the exact filepath so it depends on what your current working directory is. Pros: support shell completion so good for interactive experimentation, see below. - Module:
python -m hspn.prepare
Cons: No shell completion. Pros: Can be run anywhere as long ashspn
is installed. - Shortcut:
hspn-prepare
this is an alias for option (2) and is installed by pip in$HOME/.local/bin/
. Cons: No shell completion, might not be optimal in containers where$HOME/.local/bin
is not in$PATH
. Pros: Can be run anywhere as long ashspn
is installed, easy to discoverhspn
commands viahspn-<TAB><TAB>
.
For interactive experimentation it is recommended to use option (1) above and take advantage of shell completion which can be installed with:
hspn-<train/prepare/etc> --shell-completion install=<bash/zsh/fish>
# for a useful shorthand version:
hspn-<train/prepare/etc> -sc install=$(basename $SHELL)
To install train and prepare (could be placed in ~/.zshrc
/~/.bashrc
/etc):
hspn-train -sc install=$(basename $SHELL)
hspn-prepare -sc install=$(basename $SHELL)
Now, you can get autocomplete while setting configuration options! Remember that you must specify the path to the file for autocomplete to work. Try:
python src/hspn/train.py model.<TAB><TAB>
Note: depending on your machine completion may lag a bit.
If you encounter an error such as:
ValueError: CategoricalDistribution does not support dynamic value space
This is likely because there is an Optuna DB persisted to disk (e.g., Redis) that already has a study with the same name you are using. You have changed the search space (rather than just resuming the study) and now there is a mismatch.
There are a few ways to address it,
- Use a different study name either in the config file, at the CLI, or with the environment variable
OPTUNA_STUDY_NAME
which will get passed through to the config which has something likestudy_name: ${oc.env:STUDY_NAME}
- Delete the old study
- Delete the entire database if you just want to start over (may be at
.redis/
depending on the configuration, check launch script if not) - Dont use the disk-persisted Optuna DB. This feature is optional and not vital to run Optuna sweeps. The example
train_hpo_optuna.yaml
setshydra.sweeper.storage=${oc.env:OPTUNA_STORAGE_URL,null}
whereOPTUNA_STORAGE_URL
is set at launch. If this value isnull
then an in-memory store is used and not persisted to disk.
If you encounter a build error such as:
No space left on device
Build in a sandbox with --sandbox
then convert the sandbox to an image with apptainer build image.sif image.sif/
For example, instead of
# Standard build:
apptainer build --fakeroot --bind $(pwd):/workspace hspn.sif cluster/hspn.def
Use a sandbox,
# Sandbox build:
apptainer build --fakeroot --bind "$(pwd):/workspace" --sandbox hspn.sif/ cluster/hspn.def
apptainer build --fakeroot hspn.sif hspn.sif/
Apptainer/Singularity does not implement layer caching like Docker so having a persitent sandbox may be of interest to help build time during development. For a persistent sandbox, simply name it something else:
# Persistent sandbox build
apptainer build --fakeroot --bind "$(pwd):/workspace" --sandbox hspn.sandbox/ cluster/hspn.def
apptainer build --fakeroot hspn.sif hspn.sandbox/