Recovery from output stager failures

1. Double check that the output stager is not currently running.

On a Gaea login node, run squeue -u $USER -o "%.18i %.2t %.9P %.50j". The output will look something like

             JOBID ST PARTITION                                                NAME
         207177146  R     batch                                NWA25_cobalt_2024_08
          67197158  R   rdtn_c5  NWA12_COBALT_2024_09_nudgets-90d.o207177138.output
          67197331  R   rdtn_c5  NWA12_COBALT_2024_08_repeating_1992_1993.o20717713

All output stager jobs will be on the partition ldtn_c5 or rdtn_c5. In this case I have two output jobs running. Neither are for the experiment that I am trying to restart, which is named NWA12_COBALT_2024_09.

2. cd to the fre experiment directory.

The fre experiment directory will be within your scratch space on f5 or f6, followed by the FRE stem, followed by the experiment name and platform. In the example case I'm using, the full path to the experiment directory is /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod

3. Check if there are any lock files and remove if necessary.

Within the experiment directory, run ls state/run/*.lock. If you see lock files leftover from the output stager job that failed, you should remove them:

> ls state/run/*.lock
NWA12_COBALT_2024_09.o207177135.output.stager.19950101.A.args.lock  NWA12_COBALT_2024_09.o207177135.output.stager.19960101.R.args.lock
NWA12_COBALT_2024_09.o207177135.output.stager.19950101.H.args.lock

> rm -f state/run/*.lock

Use rm with more care if you have actively running jobs in addition to the job that failed.

4. Run output.retry.

Stay in the same experiment directory. Load the appropriate fre module (e.g., module load fre/test) if you haven't already. Then run output.retry state/run. This will automatically attempt to submit all output stager jobs that haven't completed.

5. If needed, remove records of previous attempts to retry the output stager.

Sometimes the output stager jobs have already been automatically retried the maximum number of times (6). In this case, output.retry will report an error. FRE keeps track of how many times it has retried a job by appending @ xferRetry++ to the file containing arguments for the job. You can delete all past records of retrying and re-run output.retry by running find state/run/ -name '*.args' | xargs sed -i '/xferRetry++/d' && output.retry state/run/.

6. If all else fails, manually gcp.

Normally, as long as the local output stager job completed, the raw files will be stored in the same FRE experiment directory. For example, my history files for this example experiment are stored within /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod/archive/history. To transfer a tar file to GFDL, I would

cd /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod/archive/history
module load gcp
gcp --batch 19940101.nc.tar gfdl:/archive/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/gfdl.ncrc6-intel23-prod/history/

Note that including --batch will submit it as a job to a data transfer node. If there is a complete failure with the data transfer nodes, this option probably won't work either.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery from output stager failures

1. Double check that the output stager is not currently running.

2. cd to the fre experiment directory.

3. Check if there are any lock files and remove if necessary.

4. Run output.retry.

5. If needed, remove records of previous attempts to retry the output stager.

6. If all else fails, manually gcp.

CEFI Computing Guide Main Page

Getting and Maintaining Your Accounts

Remote Access

Configuring and Running Model Simulations

Data Storage and Archive

Analyzing Model Output

CEFI Code Management

Troubleshooting

Clone this wiki locally