-
Notifications
You must be signed in to change notification settings - Fork 8
Description
I'm running your device plugin in OpenShift 3.11 which has kubernetes under the hood. I realize you might not have done any testing with OCP but figured you might be able to help. Here is the setup:
- Physical host (with edge tpu usb device attached)
- OCP cluster virtual machines
I am able to connect the TPU to the physical host and run the python demo code to show it works. I can even assign the USB device to a compute node VM and in the VM run the python demo code to show that the VM sees the device and can talk to it.
Before I do anything with the daemonset I see this on the physical host:
$ lsusb
Bus 002 Device 005: ID 1a6e:089a Global Unichip Corp.
I then use your yaml to deploy the daemonset. One of the pods in the daemonset shows:
oc logs -f edgetpu-device-plugin-52sjt
I0812 16:17:10.264373 1 plugin.go:98] Started gRPC service on plugin socket
I0812 16:17:10.264399 1 plugin.go:101] Started monitoring devices
I0812 16:17:10.264404 1 plugin.go:49] gRPC server started.
I0812 16:17:10.264607 1 plugin.go:118] Opened connection to kubelet socket
I0812 16:17:10.268002 1 server.go:56] Start watching devices
I0812 16:17:10.268025 1 server.go:66] Update a device list
I0812 16:17:10.268092 1 plugin.go:132] Registered device plugin
I0812 16:17:15.369094 1 server.go:150] Edge TPU became active.
I0812 16:17:15.369137 1 server.go:66] Update a device list
So far that all looks good. I then deploy the sample with your yaml file and it comes back with:
oc logs -f edgetpu-demo-9cb92
ERROR: Failed to retrieve TPU context.
ERROR: Node number 0 (edgetpu-custom-op) failed to prepare.
Failed in Tensor allocation, status_code: 1
And then if I go back to the physical host:
lsusb
Bus 002 Device 006: ID 18d1:9302 Google Inc.
It changed from 002:005 to 002:006. It is like the physical host thinks the USB device was disconnected and reconnected. I have see this before I started using your code where I'd run a container on the VM and it would fail, see the device changed on the host, readd device to VM, and run container on VM again...and it works.
Would you have any insight into why talking to the device or somehow assigning it to a container causes this name change? Thank you.