Skip to content

test: eagerly import torch in conftest to fix test-api shard flake#2376

Merged
nicoloboschi merged 1 commit into
mainfrom
investigate/test-api-torch-docstring-flake
Jun 25, 2026
Merged

test: eagerly import torch in conftest to fix test-api shard flake#2376
nicoloboschi merged 1 commit into
mainfrom
investigate/test-api-torch-docstring-flake

Conversation

@nicoloboschi

Copy link
Copy Markdown
Collaborator

Symptom

test-api shard 2/3 intermittently fails, with dozens of tests erroring at collection time:

RuntimeError: function '_has_torch_function' already has a docstring
../.venv/.../torch/overrides.py:1778: RuntimeError

The count drifts run-to-run (62 → 58 → 34 collection errors), and re-running sometimes clears it — a classic nondeterministic flake. It is unrelated to any individual PR; it just happens to land on whichever tests xdist schedules onto the affected worker.

Root cause

The first import torch in a worker process happens lazily, from inside concurrent/async code:

tests/conftest.py:415  embeddings fixture -> emb.initialize()
hindsight_api/engine/embeddings.py:174  from sentence_transformers import SentenceTransformer
   -> transformers -> import torch
   -> torch/_tensor.py -> torch/overrides.py:1778
        has_torch_function = _add_docstr(_has_torch_function, "...")   # <-- raises

cross_encoder.py similarly imports torch lazily from inside a ThreadPoolExecutor. torch's C-level _add_docstr(...) registration in torch/overrides.py is not re-entrancy-safe — if torch/overrides.py gets executed twice in one process (which happens when the first import is driven from concurrent/threaded code), the second execution raises because the docstring is already set. That kills collection of every test on the shard.

Fix

Import torch once, in the main thread, at conftest import time — before any fixture spins up an event loop or sentence-transformers' thread pools. This makes the _add_docstr registration happen exactly once per xdist worker process; every later lazy import torch is then just a sys.modules hit. Guarded with try/except ImportError so slim/no-torch environments still collect.

Test-only change (tests/conftest.py); no production code touched.

Verification

  • ruff check clean (the # noqa: F401 keeps the intentional import).
  • Local suite still collects and runs (test_link_utils.py: 40 passed).
  • The flake is nondeterministic, so the real signal is repeated-green test-api shards in CI on this PR.

test-api shard 2/3 intermittently failed collection of dozens of tests
with 'RuntimeError: function _has_torch_function already has a docstring'.

Root cause: the first import torch in a worker process happened lazily
from inside concurrent/async code (embeddings.initialize() ->
sentence_transformers -> transformers -> torch, and cross_encoder's
ThreadPoolExecutor). torch/overrides.py's C-level _add_docstr is not
re-entrancy-safe, so under concurrency torch/overrides.py could execute
twice and raise, failing collection of every test on the shard.

Fix: import torch once at conftest import time (single-threaded, before any
event loop or thread pool), so the registration happens exactly once per
xdist worker. Guarded for slim/no-torch environments.
@nicoloboschi nicoloboschi merged commit 701de32 into main Jun 25, 2026
194 of 196 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant