Skip to content

[autorevert] dispatch untargeted signals without tests-to-include filter#8037

Open
izaitsevfb wants to merge 1 commit into
mainfrom
iz/autorevert-untargeted-signal-fix
Open

[autorevert] dispatch untargeted signals without tests-to-include filter#8037
izaitsevfb wants to merge 1 commit into
mainfrom
iz/autorevert-untargeted-signal-fix

Conversation

@izaitsevfb
Copy link
Copy Markdown
Contributor

Summary

Two related fixes to the autorevert lambda's tests-to-include dispatch path. Both surface the same anti-pattern: a signal that cannot be narrowed to a run_test.py-recognized module should be dispatched as a "re-run the whole job(s)" restart, not as a TEST-narrow restart.

Bug 1 — empty-file rows in tests.all_test_runs produce bogus tests-to-include entries

TestRow.test_id (signal_extraction_types.py:138-141) builds the test key as f"{file}::{name}" if file else name. When the CH row has empty file, the test key falls back to just the bare name (no ::).

signal_extraction.py then derived test_module from the key:

test_module = test_id.split("::")[0].replace(".py", "")

With no ::, test_module becomes the bare method name (e.g. test_partial_eval_graph_conv). The lambda forwards that into workflow_dispatch.tests-to-include, and pytorch/pytorch:test/run_test.py:144 TestChoices.__contains__ rejects it with:

argument -i/--include: invalid choice: 'test_partial_eval_graph_conv'

The dispatched shard fails before running any tests, surfacing as 15-25 spurious test-osdc failures per affected commit on HUD trunk. Tracking: pytorch/pytorch-gha-infra#1137.

Verified primary-source: tests.all_test_runs for name='test_partial_eval_graph_conv' over the last 2 days returns three distinct file values: 'test_jit.py', 'test_jit_legacy.py', ''.

Concrete instance — pytorch/pytorch run 25294380825 (workflow_dispatch by pytorch-auto-revert[bot], 2026-05-03 23:51 UTC):

jobs-to-include: linux-jammy-py3.10-clang18
tests-to-include: test_jit test_freeze_module_detach_gradient test_partial_eval_graph_conv test_freeze_interface_swapping_two_methods test_partial_eval_stitching test_returning_input_symbolic_shapes

test_jit is the actual file name; the other five are method names that leaked through the empty-file fallback. The default / crossref / dynamo_wrapped shards on this run failed at argparse; the openreg / einops shards were green only because their test.sh paths bypass $INCLUDE_CLAUSE entirely.

Bug 2 — JOB+TEST signals on the same (workflow, commit) silently lose JOB intent

signal_actions.py::group_actions keys the restart map on (workflow_name, commit_sha). ALL contributing signals — JOB-track and TEST-track — are merged into a single ActionGroup with the union of jobs_to_include and tests_to_include. A JOB-track signal (test_module=None, intent = "re-run the whole job") coexisting with a TEST-track signal in the same group becomes a TEST-narrow restart: only the test modules contributed by the TEST signals run, and the JOB signal's full-job-restart intent is silently thrown away.

This was a latent bug independent of Bug 1 — surfaced while auditing the tests-to-include path.

Fix

  1. signal_extraction.py: when test_id lacks ::, set test_module=None. The signal stays alive but is marked untargeted.
  2. signal_actions.py::group_actions: if any source in the group has test_module is None, dispatch with tests_to_include=frozenset() for the whole group.

JOB-track signals (always test_module=None) and TEST-track signals with no extractable module both reach the new branch, so they get identical "re-run the whole job(s)" treatment instead of being narrowed by sibling targetable TEST signals. workflow_checker.py:165 already drops the tests-to-include input when the frozenset is empty — no change needed there.

Test plan

  • tests/test_signal_extraction.py (2 new): empty file → Signal with test_module=None; populated file → existing module path.
  • tests/test_signal_actions.py (4 new in TestGroupActionsTestsToInclude): only-targeted-tests keeps filter; TEST-with-no-module drops filter; mixed targeted+untargeted TEST drops filter; JOB+TEST same-(wf,sha) drops filter.
  • make test — all 146 tests pass (1 skipped due to no GITHUB_TOKEN).
  • ruff format --check + ruff check clean.

Audit follow-up (not in this PR)

Worth checking which test uploader writes empty-file rows to tests.all_test_runs — likely a partial-export issue from a specific reporter — to fix the data leak at source. Open question carried in iz2-memory:core/knowledge/autorevert-bot-retry.md.

When a TEST-track signal's CH row in `tests.all_test_runs` has empty
`file`, `TestRow.test_id` falls back to the bare `name` (no `::`), and
`signal_extraction.py` previously emitted that bare method name as
`test_module`. The lambda then forwarded it through `tests-to-include`
into `workflow_dispatch`, where `run_test.py`'s `--include` argparse
choices rejected it with `argument -i/--include: invalid choice` and
the dispatched shard failed before running anything (15-25 spurious
test-osdc failures per affected commit).

Coalescing in `group_actions` had a sibling problem: any group keyed
by `(workflow_name, commit_sha)` merges JOB-track and TEST-track
signals into a single restart with the union of `jobs_to_include`
and `tests_to_include`. A JOB-track signal (test_module always None,
intent = "re-run the whole job") coexisting with a TEST-track signal
silently became a TEST-narrow restart — the JOB signal's intent was
thrown away.

Fix:

1. `signal_extraction.py`: when `test_id` lacks `::`, set
   `test_module=None` (signal stays alive but is untargeted).
2. `signal_actions.py::group_actions`: if any source in the group
   has `test_module=None`, dispatch with `tests_to_include=frozenset()`
   for the whole group. JOB-track signals AND TEST-track signals with
   no extractable module both reach this branch, so they get the
   same job-style "re-run the whole job(s)" treatment instead of
   being narrowed by sibling targetable TEST signals.

Tests:

- `test_signal_extraction`: empty `file` produces a Signal with
  `test_module=None`; populated `file` still produces the module path.
- `test_signal_actions`: 4 cases covering only-targeted-tests,
  TEST-with-no-module-only, mixed targeted+untargeted TEST, and
  JOB+TEST same-(wf,sha).

Concrete instance: pytorch/pytorch run 25294380825 (workflow_dispatch
by pytorch-auto-revert[bot], 2026-05-03 23:51 UTC) dispatched with
`tests-to-include: test_jit test_freeze_module_detach_gradient
test_partial_eval_graph_conv test_freeze_interface_swapping_two_methods
test_partial_eval_stitching test_returning_input_symbolic_shapes`
— `test_jit` was the file name, the other five were method names
that leaked through the empty-`file` fallback. With this change,
that group would have dispatched with `tests-to-include` omitted
entirely.

Tracking: pytorch/pytorch-gha-infra#1137.
@pytorch-bot pytorch-bot Bot added the ci-no-td label May 4, 2026
@vercel
Copy link
Copy Markdown

vercel Bot commented May 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
torchci Ignored Ignored May 4, 2026 10:15pm

Request Review

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 4, 2026
Copy link
Copy Markdown
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants