-
Notifications
You must be signed in to change notification settings - Fork 22
Recovery from output stager failures
On a Gaea login node, run squeue -u $USER -o "%.18i %.2t %.9P %.50j"
. The output will look something like
JOBID ST PARTITION NAME
207218841 R batch OM4_0125_COBALTv3_jra55_const
67717255 R dtn_f5_f6 OM4_0125_COBALTv3_jra55_const.o207218635.output.st
67719815 R dtn_f5_f6 OM4_0125_COBALTv3_jra55_const.o207218744.output.st
67718118 R dtn_f5_f6 OM4_0125_COBALTv3_jra55_const.o207218685.output.st
67719886 R dtn_f5_f6 OM4_0125_COBALTv3_jra55_const.o207218744.output.st
67718911 R dtn_f5_f6 OM4_0125_COBALTv3_jra55_const.o207218720.output.st
67718958 R dtn_f5_f6 OM4_0125_COBALTv3_jra55_const.o207218720.output.st
67720673 R dtn_f5_f6 OM4_0125_COBALTv3_jra55_const.o207218778.output.st
67720747 R dtn_f5_f6 OM4_0125_COBALTv3_jra55_const.o207218778.output.st
67721568 R dtn_f5_f6 OM4_0125_COBALTv3_jra55_const.o207218797.output.st
67721669 R dtn_f5_f6 OM4_0125_COBALTv3_jra55_const.o207218797.output.st
All output stager jobs will be on the partition dtn_f5_f6
. In this case I have ten output jobs running. Neither are for the experiment that I am trying to restart, which is named OM4_0125_COBALTv3_jra55_const
.
The fre experiment directory will be within your scratch space on f5 or f6, followed by the FRE stem, followed by the experiment name and platform. In the example case I'm using, the full path to the experiment directory is /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod
Within the experiment directory, run ls state/run/*.lock
. If you see lock files leftover from the output stager job that failed, you should remove them:
> ls state/run/*.lock
NWA12_COBALT_2024_09.o207177135.output.stager.19950101.A.args.lock NWA12_COBALT_2024_09.o207177135.output.stager.19960101.R.args.lock
NWA12_COBALT_2024_09.o207177135.output.stager.19950101.H.args.lock
> rm -f state/run/*.lock
Use rm
with more care if you have actively running jobs in addition to the job that failed.
Stay in the same experiment directory. Load the appropriate fre module (e.g., module load fre/test
) if you haven't already. Then run output.retry state/run
. This will automatically attempt to submit all output stager jobs that haven't completed.
Sometimes the output stager jobs have already been automatically retried the maximum number of times (6). In this case, output.retry
will report an error. FRE keeps track of how many times it has retried a job by appending @ xferRetry++
to the file containing arguments for the job. You can delete all past records of retrying and re-run output.retry by running find state/run/ -name '*.args' | xargs sed -i '/xferRetry++/d' && output.retry state/run/
.
Normally, as long as the local output stager job completed, the raw files will be stored in the same FRE experiment directory. For example, my history files for this example experiment are stored within /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod/archive/history
. To transfer a tar file to GFDL, I would
cd /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod/archive/history
module load gcp
gcp --batch 19940101.nc.tar gfdl:/archive/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/gfdl.ncrc6-intel23-prod/history/
Note that including --batch
will submit it as a job to a data transfer node. If there is a complete failure with the data transfer nodes, this option probably won't work either.
After manually transferring the history files from Gaea
to PPAN
, you may also need to re-run the FREPP
step to generate the post-processed files.
First, locate your backup XML file, which should have been automatically copied from Gaea
to PPAN
and is typically located in the ncrc
folder:
cd ~/ncrc/THE_PATH_NAME_OF_YOUR_FRE_XML_ON_GAEA/
You should find your FRE XML file there. Then, use the following command to run FREPP manually:
module load fre/bronx-22 # (or fre/bronx-21)
frepp -t YEAR -s -d /archive/$USER/FRE_STEM/EXP_NAME/YOUR_PLATFORM/history -x YOUR_FRE_XML.xml -p PLATFORM -T YOUR_TARGET YOUR_EXP_NAME
The above command should automatically submit the PP
jobs.