Skip to content

MG Setup (or refresh) fails with "too many heavy quark residual restarts" #1323

@kostrzewa

Description

@kostrzewa

This is a very strange issue that has come up quite recently and which might require another bisection, unless there's a smoking gun that I'm missing or it's simply an instance of me doing something stupid.

Somewhat randomly, QUDA, as used in tmLQCD's HMC, is having trouble either generating the MG setup or refreshing it. I say randomly because it sometimes works fine, suggesting that likely unitialised memory is used or memory holding a param struct is overwritten. It's also strange that the same exact version of both tmLQCD and QUDA sometimes works / has worked for prolonged periods of time. Perhaps there is some bad interaction with the driver.

I have been able to reproduce the issue on two machines: Juwels Booster and a small GPU Cluster at Bonn University sporting A40s.

  • Juwels Booster seems to run: Driver Version: 515.65.01 CUDA Version: 11.7 (from nvidia-smi)

  • QUDA and tmLQCD are compiled with:

    • GCC/11.2.0
    • OpenMPI/4.1.2
    • CUDA/11.5
  • The A40 cluster runs: Driver Version: 515.48.07 CUDA Version: 11.7 (from nvidia-smi)

  • QUDA and tmLQCD are compiled with:

    • GCC/11.2.0
    • OpenMPI/4.1.2-GCC-11.2.0
    • UCX-CUDA/1.11.2-GCCcore-11.2.0-CUDA-11.4.1
    • CUDA/11.4.1

Juwels Booster

I've observed the following behaviour with 227ff8c (commit from the end of March) on Juwels Booster. I have also reproduced the same behaviour with 64998f1 (mid-September) so if it's a QUDA problem, it's one which has been present for quite a while.

  1. The MG setup at the beginning of the trajectory is generated correctly
# TM_QUDA: Performing MG Preconditioner Setup for gauge_id: 0.000000
MG level 0 (GPU): Using randStateMRG32k3a
MG level 0 (GPU): Creating a CG solver
MG level 0 (GPU): Running vectors setup on level 0 iter 1 of 1
MG level 0 (GPU): Tuned block=(800,1,1), grid=(172,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 514.12 Gflop/s, 1028.23 GB/s for N4quda4b
las5Norm2IdfEE with GPU-offline,nParity=2,vol=442368,precision=4,order=4,Ns=4,Nc=3,TwistFlavour=1
MG level 0 (GPU): Initial guess = 8.49318e+07
MG level 0 (GPU): Initial rhs = 0
[...]
MG level 0 (GPU): CG:   353 iterations, <r,r> = 1.611368e-05, |r|/|b| = 4.972414e-07
MG level 0 (GPU): CG: Reliable updates = 7
MG level 0 (GPU): CG: Convergence at 353 iterations, L2 relative residual: iterated = 4.972414e-07, true = 4.972414e-07 (requested = 5.0000
00e-07)
[...]
  1. The first time the refresh is triggered, I get:
# TM_QUDA: Refreshing MG Preconditioner Setup for gauge_id: 0.027778
MG level 0 (GPU): Resetting level 0
MG level 0 (GPU): Creating a CG solver
MG level 0 (GPU): Running vectors setup on level 0 iter 1 of 1
MG level 0 (GPU): Initial guess = 1
MG level 0 (GPU): Initial rhs = 0
MG level 0 (GPU): CG:     0 iterations, <r,r> = 2.219844e-03, |r|/|b| = 1.000000e+00
MG level 0 (GPU): CG:     1 iterations, <r,r> = 2.384536e-04, |r|/|b| = 3.277484e-01
MG level 0 (GPU): CG:     2 iterations, <r,r> = 5.302983e-05, |r|/|b| = 1.545606e-01
MG level 0 (GPU): WARNING: CG: solver exiting due to too many heavy quark residual restarts (0/-1325988027)
MG level 0 (GPU): CG: Reliable updates = 0

and the job just hangs there. I take the -1325988027 to mean that something has not been properly initialised. I'm also quite surprised that the heavy quark residual plays a role here because I certainly don't set anything knowingly that would trigger its use, but I might be missing some QUDA-internal condition.

A40 cluster

On the A40 cluster, instead, not even the setup runs through:

# TM_QUDA: Performing MG Preconditioner Setup for gauge_id: 0.000000
[..]
MG level 0 (GPU): WARNING: CG: solver exiting due to too many heavy quark residual restarts (0/-2147483648)
MG level 0 (GPU): CG: Reliable updates = 0

Comments

Unfortunately, the issue seems to be very hard to reproduce. I thought that I was seeing a pattern: when I set the MG verbosity to (verbose, summarize, summarize) on levels 0, 1 and 2, respectively, the jobs seemed to run through fine, but in the situation quoted above, MG verbosity was set exactly like this and CG in the MG refresh still hung itself up... Similarly, attempts to run the same exact trajectory twice seem to either work or fail, perhaps depending on the set of nodes that are used?

I thought similarly that setting QUDA_ENABLE_P2P=0 seemed to help on the A40 cluster, but given the erratic behaviour that I observe on Juwels Booster I'm not so sure any more and I currently can't run further tests there.

If anyone has seen similar behaviour recently, I would be very grateful for any information!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions