-
Notifications
You must be signed in to change notification settings - Fork 22
Recovery from output stager failures
On a Gaea login node, run squeue -u $USER -o "%.18i %.2t %.9P %.50j"
. The output will look something like
JOBID ST PARTITION NAME
207177146 R batch NWA25_cobalt_2024_08
67197158 R rdtn_c5 NWA12_COBALT_2024_09_nudgets-90d.o207177138.output
67197331 R rdtn_c5 NWA12_COBALT_2024_08_repeating_1992_1993.o20717713
All output stager jobs will be on the partition ldtn_c5
or rdtn_c5
. In this case I have two output jobs running. Neither are for the experiment that I am trying to restart, which is named NWA12_COBALT_2024_09
.
The fre experiment directory will be within your scratch space on f5 or f6, followed by the FRE stem, followed by the experiment name and platform. In the example case I'm using, the full path to the experiment directory is /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod
Within the experiment directory, run ls state/run/*.lock
. If you see lock files leftover from the output stager job that failed, you should remove them:
> ls state/run/*.lock
NWA12_COBALT_2024_09.o207177135.output.stager.19950101.A.args.lock NWA12_COBALT_2024_09.o207177135.output.stager.19960101.R.args.lock
NWA12_COBALT_2024_09.o207177135.output.stager.19950101.H.args.lock
> rm -f state/run/*.lock
Use rm
with more care if you have actively running jobs in addition to the job that failed.
Stay in the same experiment directory. Load the appropriate fre module (e.g., module load fre/test
) if you haven't already. Then run output.retry state/run
. This will automatically attempt to submit all output stager jobs that haven't completed.
Sometimes the output stager jobs have already been automatically retried the maximum number of times (6). In this case, output.retry
will report an error. FRE keeps track of how many times it has retried a job by appending @ xferRetry++
to the file containing arguments for the job. You can delete all past records of retrying and re-run output.retry by running find state/run/ -name '*.args' | xargs sed -i '/xferRetry++/d' && output.retry state/run/
.
Normally, as long as the local output stager job completed, the raw files will be stored in the same FRE experiment directory. For example, my history files for this example experiment are stored within /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod/archive/history
. To transfer a tar file to GFDL, I would
cd /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod/archive/history
module load gcp
gcp --batch 19940101.nc.tar gfdl:/archive/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/gfdl.ncrc6-intel23-prod/history/
Note that including --batch
will submit it as a job to a data transfer node. If there is a complete failure with the data transfer nodes, this option probably won't work either.