Fix issues when running multiple LoRA tests on the v6e-8 machine. #926
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Problem: When I run my CI test https://github.com/vllm-project/tpu-inference/blob/main/tests/lora/test_lora.py on the v6e-8 CI machine, the first test succeeds and the second one fails with the error
jaxlib._jax.XlaRuntimeError: UNKNOWN: TPU initialization failed: open(/dev/vfio/1): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/1. It appears that after the first test finishes, it didn't release the TPU device. It's persistent. In my local env, ~70% of the chance I could reproduce the issue. Also I couldn't reproduce the issue on the v6e-1 machine in the CI.This PR intends to fix the above problem.
Tests
CI
Checklist
Before submitting this PR, please make sure: