Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion cmd/cloud_orchestrator/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,10 @@ func LoadInstanceManager(config *config.Config) instances.Manager {
if err != nil {
log.Fatal("Failed to get docker client: ", err)
}
im = instances.NewDockerInstanceManager(config.InstanceManager, cli)
im, err = instances.NewDockerInstanceManager(config.InstanceManager, cli)
if err != nil {
log.Fatal("Failed to create Docker Instance Manager: ", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should return (instances.Manager, error) if it can fail like this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used log.Fatal to keep consistency, as other place in this file does. Would you want me to refactor here?

}
default:
log.Fatal("Unknown Instance Manager type: ", config.InstanceManager.Type)
}
Expand Down
24 changes: 22 additions & 2 deletions pkg/app/instances/docker.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,12 @@ const DockerIMType IMType = "docker"

type DockerIMConfig struct {
DockerImageName string
GpuManufacturer string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid terms like manufacturer that are not part of the docker documentation https://docs.docker.com/desktop/features/gpu. Try to use similar names used in the documentation that docker users are already familiar with

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is equivalant to --env NVIDIA_DRIVER_CAPABILITIES=all --gpus all --runtime nvidia in terms of executing docker run, and I don't think there's a proper name to represent my purpose on docker documentation.

--env and --runtime looks fine to expose into CO configuration, but --gpus in docker run directly specifies GPU allocation. I think --gpus shouldn't be exposed into CO configuration, since DockerIM runs multiple docker instances and we probably want to consider GPU allocation by CO later. I don't want to design how DockerIM allocates GPU right now, as it's pretty complicated to consider details of nvidia GPUs.

So, I need to define a new name to pass information whether DockerIM will utilize GPU or not. Retrieving such information by parsing --env or --runtime is appropriate. Or, considering boolean configuration such as UseNvidiaGpu looks valid to me. If CO configuration for GPU is representative enough to set --env or --runtime, I think we don't need to define new configurations in advance which can make compatibility issue in the far future.

Copy link
Member

@ser-io ser-io Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's solve #494 (comment) first.

Copy link
Member

@ser-io ser-io Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ability to create docker instances using GPU should be part of the CO public API, not hidden as a CO configuration. Please explain your case why do you want to hide this ability from the end users.

For reference, the ability to add accelerators is part of the public API for GCE hosts. See #319. Also the gcloud and docker CLIs follow the same principle. Going the opposite way here should be properly justified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot say it's under same principle because of the way how docker run --runtime works. Valid values of --runtime flag relies on the dockerd configuration setup. At least, this isn't proper to be exposed into cvdr CLI, but should be in CO configuration.

$ sudo nvidia-ctk runtime configure --runtime=docker # Modifies /etc/docker/daemon.json with adding a new runtime.
$ cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
$ sudo systemctl restart docker
# Then users can execute `docker run --runtime nvidia [args]`

On the other hand, I think it's a bit complicated to make agreement from here... I'll propose a design around GPU utilization when I have time, perhaps with GPU allocation mechanism too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

HostOrchestratorPort int
}

const gpuManufacturerNvidia = "nvidia"

const (
dockerLabelCreatedBy = "created_by"
dockerLabelKeyManagedBy = "managed_by"
Expand All @@ -66,11 +69,14 @@ const (
DeleteHostOPType OPType = "deletehost"
)

func NewDockerInstanceManager(cfg Config, cli *client.Client) *DockerInstanceManager {
func NewDockerInstanceManager(cfg Config, cli *client.Client) (*DockerInstanceManager, error) {
if cfg.Docker.GpuManufacturer != "" && cfg.Docker.GpuManufacturer != gpuManufacturerNvidia {
return nil, fmt.Errorf("unsupported GPU manufacturer: %q", cfg.Docker.GpuManufacturer)
}
return &DockerInstanceManager{
Config: cfg,
Client: cli,
}
}, nil
}

func (m *DockerInstanceManager) ListZones() (*apiv1.ListZonesResponse, error) {
Expand Down Expand Up @@ -371,6 +377,9 @@ func (m *DockerInstanceManager) createDockerContainer(ctx context.Context, user
Tty: true,
Labels: dockerLabelsDict(user),
}
if m.Config.Docker.GpuManufacturer == gpuManufacturerNvidia {
config.Env = []string{"NVIDIA_DRIVER_CAPABILITIES=all"}
}
hostConfig := &container.HostConfig{
Mounts: []mount.Mount{
{
Expand All @@ -381,6 +390,17 @@ func (m *DockerInstanceManager) createDockerContainer(ctx context.Context, user
},
Privileged: true,
}
if m.Config.Docker.GpuManufacturer == gpuManufacturerNvidia {
hostConfig.Resources = container.Resources{
DeviceRequests: []container.DeviceRequest{
{
Count: -1,
Capabilities: [][]string{{"gpu"}},
},
},
}
hostConfig.Runtime = "nvidia"
}
createRes, err := m.Client.ContainerCreate(ctx, config, hostConfig, nil, nil, "")
if err != nil {
return "", fmt.Errorf("failed to create docker container: %w", err)
Expand Down
Loading