-
Notifications
You must be signed in to change notification settings - Fork 135
Open
Labels
Description
Scenario:
- I have H100 instances that have been split into 20GB slices with MIG (config all-1g.20gb)
- I have on-demand L4 instances
- My pods have an affinity such that they would prefer to run on the H100 instances, but can burst onto the on-demand L4 instances
- My pods specify a 0.5 GPU fraction, with the intention that they either take 0.5 of a 20GB slice of an H100 OR they take 0.5 of an L4
When a pod is scheduled onto an H100 instance, running nvidia-smi on the GPU reservation pod shows that it can see a MIG slice. However, the pod itself cannot see any GPU resources and pytorch code does not run with the device set to CUDA.
When a pod is scheduled on one of the on-demand L4 instances, everything works as expected.
Questions:
Am I trying to do something that isn't officially supported, i.e. is GPU sharing compatible with MIG?
Is there a better way of achieving the intended behaviour?