Skip to content

GPU-Sharing compatibility with MIG #506

@gflatters

Description

@gflatters

Scenario:

  • I have H100 instances that have been split into 20GB slices with MIG (config all-1g.20gb)
  • I have on-demand L4 instances
  • My pods have an affinity such that they would prefer to run on the H100 instances, but can burst onto the on-demand L4 instances
  • My pods specify a 0.5 GPU fraction, with the intention that they either take 0.5 of a 20GB slice of an H100 OR they take 0.5 of an L4

When a pod is scheduled onto an H100 instance, running nvidia-smi on the GPU reservation pod shows that it can see a MIG slice. However, the pod itself cannot see any GPU resources and pytorch code does not run with the device set to CUDA.

When a pod is scheduled on one of the on-demand L4 instances, everything works as expected.

Questions:

Am I trying to do something that isn't officially supported, i.e. is GPU sharing compatible with MIG?
Is there a better way of achieving the intended behaviour?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions