Skip to content
This repository was archived by the owner on Nov 20, 2024. It is now read-only.
This repository was archived by the owner on Nov 20, 2024. It is now read-only.

GPU Memory errors leads to hanging GPU #14

@Narsil

Description

@Narsil

Hello, when using this plugin, I was able to run pytorch models on a shared GPU and everything works smoothly but in some cases, when one pod starts using a lot of memory, instead of classic CUDA Memory Error the gpu start spitting:
CUDA error: an illegal memory access was encountered.

And once it hits that point it's impossible to get access to the GPU through pytorch anymore.
The problem could not be reproduced on standard nvidia-plugin hence this issue.

  • Using T4 on g4dn instances.
  • nvidia-smi outputs runs correctly (shows GPU with near 0 memory usage 0% util and E. Process)
  • python -c "import torch; torch.cuda.is_available()" hangs instead, and it's impossible to send data to the GPU either (torch.zeros((2, 2)).cuda()) hangs too. Even Ctrl+c cannot cut the process at that point.
  • the plugin daemon set logs:
[2021-01-22 16:48:44.478 Other    70] Receive command failed, assuming client exit
[2021-01-22 16:48:44.478 Other    70] Volta MPS: Client disconnected
[2021-01-22 16:48:44.493 Other    70] Receive command failed, assuming client exit
[2021-01-22 16:48:44.493 Other    70] Volta MPS: Client disconnected
[2021-01-22 16:48:47.756 Control     1] Accepting connection...
[2021-01-22 16:48:47.756 Control     1] User did not send valid credentials
[2021-01-22 16:48:47.756 Control     1] Accepting connection...
[2021-01-22 16:48:47.756 Control     1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:48:47.756 Other    70] Volta MPS Server: Received new client request
[2021-01-22 16:48:47.757 Other    70] MPS Server: worker created
[2021-01-22 16:48:47.757 Other    70] Volta MPS: Creating worker thread
[2021-01-22 16:49:00.586 Control     1] Accepting connection...
[2021-01-22 16:49:00.586 Control     1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:49:00.586 Other    70] Volta MPS Server: Received new client request
[2021-01-22 16:49:00.586 Other    70] MPS Server: worker created
[2021-01-22 16:49:00.586 Other    70] Volta MPS: Creating worker thread
[2021-01-22 16:49:00.586 Other    70] Volta MPS: Device Tesla T4 (uuid 0x77b4f43a-0x2b8e3746-0x77cf831c-0x9b534399) is associated
[2021-01-22 16:49:07.339 Control     1] Accepting connection...
[2021-01-22 16:49:07.339 Control     1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:49:07.339 Other    70] Volta MPS Server: Received new client request
[2021-01-22 16:49:07.339 Other    70] MPS Server: worker created
[2021-01-22 16:49:07.339 Other    70] Volta MPS: Creating worker thread
[2021-01-22 16:49:07.339 Other    70] Volta MPS: Device Tesla T4 (uuid 0x77b4f43a-0x2b8e3746-0x77cf831c-0x9b534399) is associated

Pytorch does not support yet memory limitation (https://github.com/pytorch/pytorch/blob/5f07b53ec2074fc9fd6b4fe72d6cee4d484b917a/torch/cuda/memory.py#L75)

Is there any way in the meantime, to understand/debug the problem and to clear the GPU/the pod/the daemonset so that they start working again ? (And what are those illegal accesses ?)
Do you need any more info ?

Nuking both daemonset and guilty pod seems to work but a bit heavy handed atm.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions