GPU Memory errors leads to hanging GPU

Hello, when using this plugin, I was able to run `pytorch` models on a shared GPU and everything works smoothly but in some cases, when one pod starts using a lot of memory, instead of classic `CUDA Memory Error` the gpu start spitting: 
`CUDA error: an illegal memory access was encountered`.

And once it hits that point it's impossible to get access to the GPU through `pytorch` anymore.
The problem could not be reproduced on standard `nvidia-plugin` hence this issue.

- Using T4 on `g4dn` instances. 
- `nvidia-smi` outputs runs correctly (shows GPU with  near 0 memory usage 0% util and ` E. Process`)
- `python -c "import torch; torch.cuda.is_available()"` hangs instead, and it's impossible to send data to the GPU either (`torch.zeros((2, 2)).cuda()`) hangs too. Even Ctrl+c cannot cut the process at that point.
- the plugin daemon set logs:
```
[2021-01-22 16:48:44.478 Other    70] Receive command failed, assuming client exit
[2021-01-22 16:48:44.478 Other    70] Volta MPS: Client disconnected
[2021-01-22 16:48:44.493 Other    70] Receive command failed, assuming client exit
[2021-01-22 16:48:44.493 Other    70] Volta MPS: Client disconnected
[2021-01-22 16:48:47.756 Control     1] Accepting connection...
[2021-01-22 16:48:47.756 Control     1] User did not send valid credentials
[2021-01-22 16:48:47.756 Control     1] Accepting connection...
[2021-01-22 16:48:47.756 Control     1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:48:47.756 Other    70] Volta MPS Server: Received new client request
[2021-01-22 16:48:47.757 Other    70] MPS Server: worker created
[2021-01-22 16:48:47.757 Other    70] Volta MPS: Creating worker thread
[2021-01-22 16:49:00.586 Control     1] Accepting connection...
[2021-01-22 16:49:00.586 Control     1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:49:00.586 Other    70] Volta MPS Server: Received new client request
[2021-01-22 16:49:00.586 Other    70] MPS Server: worker created
[2021-01-22 16:49:00.586 Other    70] Volta MPS: Creating worker thread
[2021-01-22 16:49:00.586 Other    70] Volta MPS: Device Tesla T4 (uuid 0x77b4f43a-0x2b8e3746-0x77cf831c-0x9b534399) is associated
[2021-01-22 16:49:07.339 Control     1] Accepting connection...
[2021-01-22 16:49:07.339 Control     1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:49:07.339 Other    70] Volta MPS Server: Received new client request
[2021-01-22 16:49:07.339 Other    70] MPS Server: worker created
[2021-01-22 16:49:07.339 Other    70] Volta MPS: Creating worker thread
[2021-01-22 16:49:07.339 Other    70] Volta MPS: Device Tesla T4 (uuid 0x77b4f43a-0x2b8e3746-0x77cf831c-0x9b534399) is associated
```
Pytorch does not support yet memory limitation (https://github.com/pytorch/pytorch/blob/5f07b53ec2074fc9fd6b4fe72d6cee4d484b917a/torch/cuda/memory.py#L75)

Is there any way in the meantime, to understand/debug the problem and to clear the GPU/the pod/the daemonset so that they start working again ? (And what are those illegal accesses ?)
Do you need any more info ?

Nuking both daemonset and guilty pod seems to work but a bit heavy handed atm.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU Memory errors leads to hanging GPU #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU Memory errors leads to hanging GPU #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions