-
Notifications
You must be signed in to change notification settings - Fork 31
GPU Memory errors leads to hanging GPU #14
Description
Hello, when using this plugin, I was able to run pytorch
models on a shared GPU and everything works smoothly but in some cases, when one pod starts using a lot of memory, instead of classic CUDA Memory Error
the gpu start spitting:
CUDA error: an illegal memory access was encountered
.
And once it hits that point it's impossible to get access to the GPU through pytorch
anymore.
The problem could not be reproduced on standard nvidia-plugin
hence this issue.
- Using T4 on
g4dn
instances. nvidia-smi
outputs runs correctly (shows GPU with near 0 memory usage 0% util andE. Process
)python -c "import torch; torch.cuda.is_available()"
hangs instead, and it's impossible to send data to the GPU either (torch.zeros((2, 2)).cuda()
) hangs too. Even Ctrl+c cannot cut the process at that point.- the plugin daemon set logs:
[2021-01-22 16:48:44.478 Other 70] Receive command failed, assuming client exit
[2021-01-22 16:48:44.478 Other 70] Volta MPS: Client disconnected
[2021-01-22 16:48:44.493 Other 70] Receive command failed, assuming client exit
[2021-01-22 16:48:44.493 Other 70] Volta MPS: Client disconnected
[2021-01-22 16:48:47.756 Control 1] Accepting connection...
[2021-01-22 16:48:47.756 Control 1] User did not send valid credentials
[2021-01-22 16:48:47.756 Control 1] Accepting connection...
[2021-01-22 16:48:47.756 Control 1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:48:47.756 Other 70] Volta MPS Server: Received new client request
[2021-01-22 16:48:47.757 Other 70] MPS Server: worker created
[2021-01-22 16:48:47.757 Other 70] Volta MPS: Creating worker thread
[2021-01-22 16:49:00.586 Control 1] Accepting connection...
[2021-01-22 16:49:00.586 Control 1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:49:00.586 Other 70] Volta MPS Server: Received new client request
[2021-01-22 16:49:00.586 Other 70] MPS Server: worker created
[2021-01-22 16:49:00.586 Other 70] Volta MPS: Creating worker thread
[2021-01-22 16:49:00.586 Other 70] Volta MPS: Device Tesla T4 (uuid 0x77b4f43a-0x2b8e3746-0x77cf831c-0x9b534399) is associated
[2021-01-22 16:49:07.339 Control 1] Accepting connection...
[2021-01-22 16:49:07.339 Control 1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:49:07.339 Other 70] Volta MPS Server: Received new client request
[2021-01-22 16:49:07.339 Other 70] MPS Server: worker created
[2021-01-22 16:49:07.339 Other 70] Volta MPS: Creating worker thread
[2021-01-22 16:49:07.339 Other 70] Volta MPS: Device Tesla T4 (uuid 0x77b4f43a-0x2b8e3746-0x77cf831c-0x9b534399) is associated
Pytorch does not support yet memory limitation (https://github.com/pytorch/pytorch/blob/5f07b53ec2074fc9fd6b4fe72d6cee4d484b917a/torch/cuda/memory.py#L75)
Is there any way in the meantime, to understand/debug the problem and to clear the GPU/the pod/the daemonset so that they start working again ? (And what are those illegal accesses ?)
Do you need any more info ?
Nuking both daemonset and guilty pod seems to work but a bit heavy handed atm.