Skip to content

[autorevert] Surface test identity at signal level for advisor JSON#8031

Draft
izaitsevfb wants to merge 1 commit into
mainfrom
iz/advisor-payload-test-failures
Draft

[autorevert] Surface test identity at signal level for advisor JSON#8031
izaitsevfb wants to merge 1 commit into
mainfrom
iz/advisor-payload-test-failures

Conversation

@izaitsevfb
Copy link
Copy Markdown
Contributor

@izaitsevfb izaitsevfb commented May 1, 2026

Summary

Surfaces authoritative TEST signal identity (test_file / test_classname / test_name) once at the top of the advisor signal_pattern payload, so the AI advisor has structured ground truth for which test failed without re-deriving from logs. Companion to the pytorch/pytorch prompt PR that adds a "Partial Visibility Warning" section.

Context

A 30d audit found 40 garbage verdicts on 23 distinct commits later auto-reverted, with 33/37 having a workflow_job for the suspect commit still in_progress at verdict time. 8/40 summaries explicitly cite "still in_progress / hanging" jobs as the rationale. Mechanism: when the advisor's prompt has no model of partial visibility AND it doesn't have authoritative test identity, it sees status: "pending" events and rationalizes pending-heavy partitions as flake → garbage.

A TEST signal already corresponds to a single specific test by construction — one signal per (workflow, job_base, test_id). Every FAILURE event in that signal IS that specific test failing. So per-event test identity would be redundant duplication of signal_key + status. The right shape is signal-level, emitted once.

Payload shape

{
  "signal_key": "test/foo.py::test_bar",
  "signal_source": "test",
  "test_file": "test/foo.py",
  "test_classname": "TestFooBar",
  "test_name": "test_bar",
  "workflow_name": "trunk",
  "job_base_name": "linux-jammy / test",
  "commit_order": "newest_first",
  "suspect_commit": "abc...",
  "commits": [ ... ]
}

test_file / test_classname / test_name are emitted only for TEST signals and only when populated, so the payload remains backward-compatible for consumers that don't read them yet.

classname is the marginal value-add: signal_key is file::name (the test_id collapses), so classname is the field that's truly missing today. test_file and test_name are derivable from signal_key but emitted explicitly for clarity.

Changes

  • signal.py: Signal gains optional test_file / test_classname / test_name attributes (default None).
  • signal_extraction.py: _build_test_signals captures the (file, classname, name) triple from any TestRow contributing to the signal (deterministic — each TEST signal is a single test by construction) and attaches them to the Signal at construction time.
  • signal_actions.py::_build_signal_pattern_json: emits the three fields as top-level payload keys when populated, only for signal_source == "test". No per-event change.
  • tests/test_signal_actions.py:
    • test_test_identity_surfaced_for_test_signals — top-level keys present, populated, no per-event leak.
    • test_test_identity_omitted_for_job_signals — JOB signals must omit the keys entirely.

Net diff: +176/-12 across 4 files.

Why not classification kg fields?

Forwarding torchci_classification_kg.{rule, line} was considered and dropped: when a job has multiple test failures, only the first match is captured in those fields, which is effectively random and would mislead the advisor. The signal extraction already pins the specific failing test via signal_key; surfacing the structured identity (test_file / test_classname / test_name) preserves that authoritative ground truth.

Companion change

pytorch/pytorch#182176 — adds a "Partial Visibility Warning" section to the advisor prompt, tightens the garbage definition to require concrete completed-log infra evidence, and adds an explicit signal_key-is-authoritative assertion that aligns with the top-level test identity exposed here.

Test plan

  • CI runs the existing + new unit tests under pytorch_auto_revert/tests/
  • Manual: after merge, observe test_file / test_classname / test_name keys at the top of TEST advisor signal_pattern JSONs (written to /tmp/advisor-patterns/ for debugging)
  • Followup: A/B re-run the 4 standout cases from the audit with the new payload + companion prompt change to confirm verdicts flip from garbage to unsure

v2 design note (force-pushed 2026-05-01): the original draft put a test_failures: [{...}] list per event, which redundantly duplicated signal_key + status. Reworked to a single signal-level identity per the design feedback that a TEST signal already corresponds to one specific test. The new shape is leaner and the JSON is roughly the same size (one set of fields at the top vs. one per event).

@pytorch-bot pytorch-bot Bot added the ci-no-td label May 1, 2026
@vercel
Copy link
Copy Markdown

vercel Bot commented May 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
torchci Ignored Ignored Preview May 1, 2026 7:40pm

Request Review

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2026
@izaitsevfb izaitsevfb force-pushed the iz/advisor-payload-test-failures branch from af181c4 to 4e41106 Compare May 1, 2026 19:28
@izaitsevfb izaitsevfb changed the title [autorevert] Surface per-test outcome rows in advisor signal_pattern JSON [autorevert] Surface test identity at signal level for advisor JSON May 1, 2026
The autorevert AI advisor's signal_pattern JSON does not surface
authoritative test identity (file/classname/name) for TEST signals.
The advisor knows the test from `signal_key` (which is `file::name`),
but `classname` is dropped by the test_id collapse — so when many tests
share the same file or name, the advisor has to re-derive structure
from logs. When the suspect commit's parent workflow_job is still
in_progress at advisor dispatch time, that re-derivation fails and the
advisor rationalizes pending-heavy partitions as flake → garbage. A
30d audit found 40 garbage verdicts on 23 distinct commits later
auto-reverted; 33/37 had a workflow_job for the suspect commit still
in_progress at verdict time.

A TEST signal corresponds to a SINGLE specific test by construction
(one signal per `(workflow, job_base, test_id)`). Every FAILURE event
in that signal IS that specific test failing — no per-event test
identity is needed (`signal_key` + `status` already convey it).

This change surfaces the structured test identity ONCE at the top of
the advisor signal_pattern payload:

    {
      "signal_key": "test/foo.py::test_bar",
      "signal_source": "test",
      "test_file": "test/foo.py",
      "test_classname": "TestFooBar",
      "test_name": "test_bar",
      ...
    }

Implementation:

- `Signal` gains optional `test_file`/`test_classname`/`test_name`
  attributes (default None).
- `_build_test_signals` captures the (file, classname, name) triple
  from any TestRow contributing to the signal (collapses cleanly
  because each TEST signal is a single test by definition) and
  attaches them to the Signal at construction time.
- `_build_signal_pattern_json` emits the three fields as top-level
  payload keys when populated, only for `signal_source == "test"`.
- New regression tests:
  - `test_test_identity_surfaced_for_test_signals` — top-level keys
    present; no per-event leak.
  - `test_test_identity_omitted_for_job_signals` — JOB signals omit
    the keys entirely.

Companion change: pytorch/pytorch prompt PR adding a "Partial
Visibility Warning" section, tightening the `garbage` definition to
require concrete completed-log infra evidence, and adding an
explicit `signal_key`-is-authoritative assertion that aligns with the
top-level test identity exposed here.
@izaitsevfb izaitsevfb force-pushed the iz/advisor-payload-test-failures branch from 4e41106 to 923d2fd Compare May 1, 2026 19:39
izaitsevfb pushed a commit that referenced this pull request May 1, 2026
…nstruction

The signal_extraction layer reconstructs Signal and SignalCommit
objects in several places — each rebuilds a new instance with all
metadata supposed to carry over but a different list of children.
This is brittle: every site duplicates the full kwarg list, and
adding a new field on Signal or SignalCommit silently drops it during
reconstruction unless every site is updated.

This PR introduces:

- `Signal.replace_commits(new_commits)` — used by
  `_dedup_signal_events`, `_inject_pending_workflow_events`,
  `_attach_advisor_verdicts`. Per-site code shrinks from 9-11 lines
  to 1.
- `SignalCommit.replace_events(new_events)` — used by
  `_dedup_signal_events`, `_inject_pending_workflow_events`. Per-site
  code shrinks from 3 lines to 1. **Latent-bug fix**: the prior
  reconstruction dropped `advisor_result`. Currently a no-op because
  `_attach_advisor_verdicts` runs last in `extract()`, but a future
  pipeline-order change would have silently lost data.

Each helper has an introspection-based test
(`test_replace_commits_propagates_all_fields`,
`test_replace_events_propagates_all_fields`) that enumerates every
__init__ parameter via `inspect.signature`, requires the test author
to pin a sentinel value for each, and asserts every field except the
replaced one round-trips. Adding a new field without updating the
helper makes the test fail loud-and-clear.

Audit also verified `SignalEvent` has no reconstruction sites in
production (always fresh-constructed from raw data), so no helper
needed there.

This PR is preparatory for #8031, which adds three new Signal fields
(test_file/test_classname/test_name). With these helpers in place,
that PR's per-site propagation tests become redundant — the
introspection tests here enforce the guarantee structurally.

No behavior change.
izaitsevfb pushed a commit that referenced this pull request May 1, 2026
…struction

dataclasses.replace-style helpers for the two non-leaf classes in the
Signal hierarchy. Routes EVERY reconstruction site through the helper
so adding a new field to either class never silently gets dropped.

Hierarchy audit:

- Signal — 3 reconstruction sites in production (`_dedup_signal_events`,
  `_inject_pending_workflow_events`, `_attach_advisor_verdicts` — all
  replace `commits`).
- SignalCommit — 3 reconstruction sites in production:
  - `_dedup_signal_events` and `_inject_pending_workflow_events` —
    replace `events`. Previously dropped `advisor_result`. Currently a
    no-op latent bug because `_attach_advisor_verdicts` runs last in
    `extract()`.
  - `_attach_advisor_verdicts` — replaces `advisor_result`. Previously
    written as a fresh `SignalCommit(...)` constructor with all four
    fields enumerated.
- SignalEvent — no reconstruction sites in production; always
  fresh-constructed from raw data. No helper needed.
- AIAdvisorResult — frozen dataclass, no reconstruction sites in
  production. `dataclasses.replace` works for free if ever needed.
- AutorevertPattern / RestartCommits / Ineligible — dataclasses,
  constructed once per outcome, no reconstruction sites.

Helper API: `obj.replace(**changes)` returns a new instance with
selected fields replaced and the rest preserved. Unknown kwargs raise
TypeError — catches typos and stale kwargs at runtime.

The introspection-based tests
`test_replace_with_no_changes_preserves_all_fields` enumerate every
__init__ parameter via `inspect.signature`, require the test author to
pin a sentinel value for each (otherwise a self-fail `_SENTINELS` test
points at the new field), and assert that `obj.replace()` (with no
changes) returns a copy where every field equals the original.
`test_replace_unknown_kwarg_raises` catches stale kwargs.

This PR is preparatory for #8031 (test identity at signal level for
advisor JSON), which adds three new Signal fields. With these helpers
in place, that PR's per-site propagation tests become redundant — the
introspection tests here enforce the guarantee structurally for any
future field on any reconstructed class.

No behavior change.
izaitsevfb pushed a commit that referenced this pull request May 1, 2026
…struction

dataclasses.replace-style helpers for the two non-leaf classes in the
Signal hierarchy. Routes EVERY reconstruction site through the helper
so adding a new field to either class never silently gets dropped.

Hierarchy audit:

- Signal — 3 reconstruction sites in production (`_dedup_signal_events`,
  `_inject_pending_workflow_events`, `_attach_advisor_verdicts` — all
  replace `commits`).
- SignalCommit — 3 reconstruction sites in production:
  - `_dedup_signal_events` and `_inject_pending_workflow_events` —
    replace `events`. Previously dropped `advisor_result`. Currently a
    no-op latent bug because `_attach_advisor_verdicts` runs last in
    `extract()`.
  - `_attach_advisor_verdicts` — replaces `advisor_result`. Previously
    written as a fresh `SignalCommit(...)` constructor with all four
    fields enumerated.
- SignalEvent — no reconstruction sites in production; always
  fresh-constructed from raw data. No helper needed.
- AIAdvisorResult — frozen dataclass, no reconstruction sites in
  production. `dataclasses.replace` works for free if ever needed.
- AutorevertPattern / RestartCommits / Ineligible — dataclasses,
  constructed once per outcome, no reconstruction sites.

Helper API: `obj.replace(**changes)` returns a new instance with
selected fields replaced and the rest preserved. Unknown kwargs raise
TypeError — catches typos and stale kwargs at runtime.

The introspection-based tests
`test_replace_with_no_changes_preserves_all_fields` enumerate every
__init__ parameter via `inspect.signature`, require the test author to
pin a sentinel value for each (otherwise a self-fail `_SENTINELS` test
points at the new field), and assert that `obj.replace()` (with no
changes) returns a copy where every field equals the original.
`test_replace_unknown_kwarg_raises` catches stale kwargs.

This PR is preparatory for #8031 (test identity at signal level for
advisor JSON), which adds three new Signal fields. With these helpers
in place, that PR's per-site propagation tests become redundant — the
introspection tests here enforce the guarantee structurally for any
future field on any reconstructed class.

No behavior change.
izaitsevfb added a commit that referenced this pull request May 10, 2026
…struction (#8032)

## Summary

Adds `dataclasses.replace`-style helpers for the two non-leaf classes in
the Signal hierarchy and routes **every reconstruction site** through
them. Eliminates the bug class where adding a new field on `Signal` or
`SignalCommit` silently gets dropped because it wasn't propagated in
some downstream reconstruction.

```python
new_signal = signal.replace(commits=new_commits)
new_commit = commit.replace(events=new_events)
new_commit = commit.replace(advisor_result=advisor_result)
```

The `replace(**changes)` API mirrors `dataclasses.replace` semantically:
it preserves every field by default and replaces only the fields named
in `changes`. Unknown kwargs raise `TypeError` so typos and stale kwargs
are caught at runtime.

## Hierarchy audit (per in-thread question)

| Class | Reconstruction sites in prod | Helper added? |
|---|---|---|
| `Signal` | 3 (`_dedup_signal_events`,
`_inject_pending_workflow_events`, `_attach_advisor_verdicts`) — all
replace `commits` | yes — `Signal.replace(...)` |
| `SignalCommit` | **3** (`_dedup_signal_events` +
`_inject_pending_workflow_events` replace `events`;
`_attach_advisor_verdicts` replaces `advisor_result`) | yes —
`SignalCommit.replace(...)` |
| `SignalEvent` | 0 — always fresh-constructed from
`JobRow`/`TestRow`/synthetic-pending data | n/a |
| `AIAdvisorResult` | 0 in prod — frozen dataclass,
`dataclasses.replace` works for free if ever needed | n/a |
| `AutorevertPattern`, `RestartCommits`, `Ineligible` | dataclasses,
single fresh construction per outcome | n/a |

### Latent-bug fix on SignalCommit

Two of the three SignalCommit reconstructions (`_dedup_signal_events`,
`_inject_pending_workflow_events`) were dropping `advisor_result`.
Currently a no-op — `_attach_advisor_verdicts` runs last in `extract()`,
so `advisor_result` is always `None` at dedup/inject time — but a future
pipeline-order change (or any new path that produces a `SignalCommit`
with `advisor_result` set and pipes it through dedup/inject) would have
silently lost the verdict. The helper structurally fixes this.

### Why earlier-this-PR design (`replace_commits` / `replace_events`)
was wrong

The previous iteration of this PR added two specific helpers per
replaceable field (`replace_commits` for Signal, `replace_events` for
SignalCommit). On review, the `_attach_advisor_verdicts` site was
identified as semantically-equivalent to `dataclasses.replace` too — it
replaces `advisor_result` while preserving everything else. A
field-specific helper would have required `replace_advisor_result`, then
a third helper if SignalCommit ever needed to replace `head_sha` or
`timestamp`, etc. The unified `replace(**changes)` API covers any
field-replacement pattern with one method per class.

## Tests

Each class has the same test shape:

- `test_sentinels_cover_all_init_fields` — `inspect.signature` over the
`__init__` parameters; the test fails if a new field isn't pinned in
`_SENTINELS`. Forces the test author to update the helper at the same
time.
- `test_replace_with_no_changes_preserves_all_fields` — calls
`obj.replace()` and asserts every field equals the original. Catches
dropped fields generically without enumerating each individually.
- `test_replace_swaps_a_field` — smoke test that `replace(field=...)`
actually swaps.
- `test_replace_unknown_kwarg_raises` — `replace(typo=42)` raises
`TypeError`. Catches stale kwargs.

All 46 tests in `test_signal.py` pass locally (under devserver Python —
modules requiring `boto3`/`PyGithub`/etc. are deferred to CI).

## Why option 2 (manual helper + introspection test) over option 1
(auto-replace via introspection inside the helper)

An auto-replace `replace(**changes)` could iterate
`inspect.signature(...).parameters` and copy each via `getattr(self,
name)` — zero maintenance for new fields. But that couples on "every
`__init__` param has a matching `self.<name>` attribute"; a future
refactor that diverges those names would break the helper silently
rather than at review time. Manual + introspection-test keeps the helper
as plain explicit code (six `changes.pop(...)` lines for Signal) and
makes the test fail loudly when a field is added without propagation.

## Diff

`+225/-48` across 3 files. No behavior change.

## Test plan

- [x] `pytorch_auto_revert.tests.test_signal` runs green locally (46
tests)
- [x] CI runs the full suite including the integration tests under
`test_signal_dedup`, `test_signal_actions`, etc.
- [x] Manual: confirm each introspection test self-fails when a
hypothetical new field is added without a sentinel
- [ ] After this lands and FF-merges, rebase #8031 on top — three
reconstruction sites use the helper, per-site propagation tests are
dropped

---------

Co-authored-by: Ivan Zaitsev <izaitsevfb@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant