test: eagerly import torch in conftest to fix test-api shard flake#2376
Merged
Conversation
test-api shard 2/3 intermittently failed collection of dozens of tests with 'RuntimeError: function _has_torch_function already has a docstring'. Root cause: the first import torch in a worker process happened lazily from inside concurrent/async code (embeddings.initialize() -> sentence_transformers -> transformers -> torch, and cross_encoder's ThreadPoolExecutor). torch/overrides.py's C-level _add_docstr is not re-entrancy-safe, so under concurrency torch/overrides.py could execute twice and raise, failing collection of every test on the shard. Fix: import torch once at conftest import time (single-threaded, before any event loop or thread pool), so the registration happens exactly once per xdist worker. Guarded for slim/no-torch environments.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
test-apishard 2/3 intermittently fails, with dozens of tests erroring at collection time:The count drifts run-to-run (62 → 58 → 34 collection errors), and re-running sometimes clears it — a classic nondeterministic flake. It is unrelated to any individual PR; it just happens to land on whichever tests xdist schedules onto the affected worker.
Root cause
The first
import torchin a worker process happens lazily, from inside concurrent/async code:cross_encoder.pysimilarly imports torch lazily from inside aThreadPoolExecutor. torch's C-level_add_docstr(...)registration intorch/overrides.pyis not re-entrancy-safe — iftorch/overrides.pygets executed twice in one process (which happens when the first import is driven from concurrent/threaded code), the second execution raises because the docstring is already set. That kills collection of every test on the shard.Fix
Import
torchonce, in the main thread, at conftest import time — before any fixture spins up an event loop or sentence-transformers' thread pools. This makes the_add_docstrregistration happen exactly once per xdist worker process; every later lazyimport torchis then just asys.moduleshit. Guarded withtry/except ImportErrorso slim/no-torch environments still collect.Test-only change (
tests/conftest.py); no production code touched.Verification
ruff checkclean (the# noqa: F401keeps the intentional import).test_link_utils.py: 40 passed).test-apishards in CI on this PR.