Skip to content

Host USB devices changes when sample runs #2

@dmc5179

Description

@dmc5179

I'm running your device plugin in OpenShift 3.11 which has kubernetes under the hood. I realize you might not have done any testing with OCP but figured you might be able to help. Here is the setup:

  • Physical host (with edge tpu usb device attached)
  • OCP cluster virtual machines

I am able to connect the TPU to the physical host and run the python demo code to show it works. I can even assign the USB device to a compute node VM and in the VM run the python demo code to show that the VM sees the device and can talk to it.

Before I do anything with the daemonset I see this on the physical host:

$ lsusb
Bus 002 Device 005: ID 1a6e:089a Global Unichip Corp. 

I then use your yaml to deploy the daemonset. One of the pods in the daemonset shows:

oc logs -f edgetpu-device-plugin-52sjt
I0812 16:17:10.264373       1 plugin.go:98] Started gRPC service on plugin socket
I0812 16:17:10.264399       1 plugin.go:101] Started monitoring devices
I0812 16:17:10.264404       1 plugin.go:49] gRPC server started.
I0812 16:17:10.264607       1 plugin.go:118] Opened connection to kubelet socket
I0812 16:17:10.268002       1 server.go:56] Start watching devices
I0812 16:17:10.268025       1 server.go:66] Update a device list
I0812 16:17:10.268092       1 plugin.go:132] Registered device plugin
I0812 16:17:15.369094       1 server.go:150] Edge TPU became active.
I0812 16:17:15.369137       1 server.go:66] Update a device list

So far that all looks good. I then deploy the sample with your yaml file and it comes back with:

oc logs -f edgetpu-demo-9cb92

ERROR: Failed to retrieve TPU context.
ERROR: Node number 0 (edgetpu-custom-op) failed to prepare.

Failed in Tensor allocation, status_code: 1

And then if I go back to the physical host:

lsusb
Bus 002 Device 006: ID 18d1:9302 Google Inc. 

It changed from 002:005 to 002:006. It is like the physical host thinks the USB device was disconnected and reconnected. I have see this before I started using your code where I'd run a container on the VM and it would fail, see the device changed on the host, readd device to VM, and run container on VM again...and it works.

Would you have any insight into why talking to the device or somehow assigning it to a container causes this name change? Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions