GPU-Sharing compatibility with MIG

Scenario:

- I have H100 instances that have been split into 20GB slices with MIG (config all-1g.20gb)
- I have on-demand L4 instances
- My pods have an affinity such that they would prefer to run on the H100 instances, but can burst onto the on-demand L4 instances
- My pods specify a 0.5 GPU fraction, with the intention that they either take 0.5 of a 20GB slice of an H100 OR they take 0.5 of an L4

When a pod is scheduled onto an H100 instance, running nvidia-smi on the GPU reservation pod shows that it can see a MIG slice. However, the pod itself cannot see any GPU resources and pytorch code does not run with the device set to CUDA.

When a pod is scheduled on one of the on-demand L4 instances, everything works as expected.

Questions:

Am I trying to do something that isn't officially supported, i.e. is GPU sharing compatible with MIG?
Is there a better way of achieving the intended behaviour?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU-Sharing compatibility with MIG #506

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU-Sharing compatibility with MIG #506

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions