Skip to content

Recovery from output stager failures

Andrew Ross edited this page Sep 5, 2024 · 4 revisions

1. Double check that the output stager is not currently running.

On a Gaea login node, run squeue -u $USER -o "%.18i %.2t %.9P %.50j". The output will look something like

             JOBID ST PARTITION                                                NAME
         207177146  R     batch                                NWA25_cobalt_2024_08
          67197158  R   rdtn_c5  NWA12_COBALT_2024_09_nudgets-90d.o207177138.output
          67197331  R   rdtn_c5  NWA12_COBALT_2024_08_repeating_1992_1993.o20717713

All output stager jobs will be on the partition ldtn_c5 or rdtn_c5. In this case I have two output jobs running. Neither are for the experiment that I am trying to restart, which is named NWA12_COBALT_2024_09.

2. cd to the fre experiment directory.

The fre experiment directory will be within your scratch space on f5 or f6, followed by the FRE stem, followed by the experiment name and platform. In the example case I'm using, the full path to the experiment directory is /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod

3. Check if there are any lock files and remove if necessary.

Within the experiment directory, run ls state/run/*.lock. If you see lock files leftover from the output stager job that failed, you should remove them:

> ls state/run/*.lock
NWA12_COBALT_2024_09.o207177135.output.stager.19950101.A.args.lock  NWA12_COBALT_2024_09.o207177135.output.stager.19960101.R.args.lock
NWA12_COBALT_2024_09.o207177135.output.stager.19950101.H.args.lock

> rm -f state/run/*.lock

4. Run output.retry.

Stay in the same experiment directory. Load the appropriate fre module (e.g., module load fre/test) if you haven't already. Then run output.retry state/run. This will automatically attempt to submit all output stager jobs that haven't completed.

5. If needed, remove records of previous attempts to retry the output stager.

Sometimes the output stager jobs have already been automatically retried the maximum number of times (6). In this case, output.retry will report an error. FRE keeps track of how many times it has retried a job by appending @ xferRetry++ to the file containing arguments for the job. You can delete all past records of retrying and re-run output.retry by running find state/run/ -name '*.args' | xargs sed -i '/xferRetry++/d' && output.retry state/run/.

6. If all else fails, manually gcp.

Normally, as long as the local output stager job completed, the raw files will be stored in the same FRE experiment directory. For example, my history files for this example experiment are stored within /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod/archive/history. To transfer a tar file to GFDL, I would

cd /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod/archive/history
module load gcp
gcp --batch 19940101.nc.tar gfdl:/archive/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/gfdl.ncrc6-intel23-prod/history/

Note that including --batch will submit it as a job to a data transfer node. If there is a complete failure with the data transfer nodes, this option probably won't work either.

Clone this wiki locally