Skip to content

fix(core): native watcher rewrite + daemon hardening for daemon-on e2e#35204

Merged
FrozenPandaz merged 87 commits into
masterfrom
investigate/watch-test-flaky
Apr 29, 2026
Merged

fix(core): native watcher rewrite + daemon hardening for daemon-on e2e#35204
FrozenPandaz merged 87 commits into
masterfrom
investigate/watch-test-flaky

Conversation

@FrozenPandaz

@FrozenPandaz FrozenPandaz commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Current Behavior

This branch started as a fix for a flaky e2e/nx/src/watch.test.ts. Once we flipped NX_DAEMON='true' in the e2e harness, several long-latent daemon issues surfaced; they're fixed in the same PR. The most user-visible regression — nx release version silently dropping projects from a fixed release group after git checkout — is the last item below and is what e2e/release/src/multiple-release-branches.test.ts exercises.

Native watcher. nx watch dropped and duplicated events on Linux/Windows:

  1. watchexec.config.pathset() registered new directories asynchronously — files written before the real inotify registration were lost.
  2. Every raw notify event fired a callback (no debounce) so one logical write produced multiple nx watch invocations.
  3. The daemon called notifyFileWatcherSockets twice per event (eagerly for updates/deletes, again post-recompute for creates), doubling invocations.
  4. getProjectsAndGlobalChanges looked changed files up in fileMap.projectFileMap, so brand-new files fell through to globalFiles and --projects=foo watchers never matched.
  5. nx:test-native didn't compile on master (walker test used nix::unistd::mkfifo behind a feature that wasn't enabled).
  6. macOS regressed after the notify migration: FsEventWatcher dropped immediately because the closure had no cfg-active references on darwin.

Daemon (surfaced once it ran end-to-end in CI):
7. getPluginsSeparated reused pendingPluginsPromise via ??= after the plugins-config hash changed, serving stale plugins forever after nx add.
8. The auto-recompute path updated the in-memory graph but never persisted it; subprocesses reading the disk cache saw stale data (flaky eslint dependency-checks).
9. startInBackground had an fd race: a concurrent reset() could null this._out/this._err while we awaited open(), crashing the spawn with Cannot read properties of null (reading 'fd').
10. The daemon's env-reflection was additive only — NX_* vars set by one CLI invocation persisted forever, leaking (e.g., NX_PREFER_NODE_STRIP_TYPES=true from a single nx report poisoned every later plugin load). Plugin worker's setWorkerEnv had the same bug.
11. Several e2e cache-hit assertions matched [local cache], but the daemon-on path emits [existing outputs match the cache, left as is] (output kept rather than restored).

Daemon graph cache (more flake hunting after the watcher rewrite landed):
12. resetInternalStateIfNxDepsMissing ran before every getCachedSerializedProjectGraphPromise and reset state whenever project-graph.json was missing AND a cached promise existed — but on cold start, every request before the first persist matches that condition. It was firing constantly, tearing down in-flight first computes and forcing a redundant recompute. Worse, its catch branch unconditionally reset on any fileExists error.
13. When resetInternalState fires from inside processCollectedUpdatedAndDeletedFiles's catch (e.g., retrieveWorkspaceFiles throws on partial disk state), cachedSerializedProjectGraphPromise is set to undefined. A concurrent stale compute that hits chainToLatest after the reset would read back undefined; the if (stale) return stale guard treats undefined as falsy, so the stale compute falls through and commits its (now-stale) data to module state.

Force-flush concurrency (flagged by Graphite review):
14. The processing thread dropped any extra ForceFlush messages it found while draining the channel (the ProcMsg::ForceFlush(_) => { /* drop the extras */ } arm). Their reply senders were silently discarded; those callers waited the full 500 ms recv_timeout and returned an empty Vec. Concurrent CLI requests would each see that latency hit.

Watcher event misclassification (the nx release version failure on multiple-release-branches.test.ts):
15. git checkout does unlink + write on some tracked files. inotify fires IN_DELETE then IN_CREATE for the same path; both landed in the watcher's accumulator within one burst.
16. merge_event's priority rule was Delete > Create > Update — meaning a Create arriving after a Delete for the same path was silently dropped. The accumulator emitted only the Delete.
17. Downstream updateFilesInContext on the JS side then removed the (still-existing) file from the Rust workspace context's index. multiGlobWithWorkspaceContext no longer returned it. Plugins didn't process it. The project was missing from the resulting graph, with no error.
18. release version then evaluated isProjectPublic, which checks for the project's package.json in the file map. With the file gone from the index, the check returned false, the project was excluded as not-public, and the release group ran with one fewer project than expected.
19. Two related classification bugs surfaced during the diagnosis: (a) the early-return for !meta_exists emitted Delete unconditionally, even for Modify/Create events whose stat happened to fail during a transient atomic-rename window; (b) notify-rs delivers a third coalesced rename event (Modify(Name(Both))) that fell through to the generic Modify(_) → Update arm, overriding the per-side From Delete on the source path of an fs::rename.

Maven (separate but bundled):
20. Tests asserted on BUILD SUCCESS in batch mode, but the resident batch runner's BatchExecutionListener.sessionEnded is a no-op — Maven's footer is replaced by Nx Maven Summary. Those assertions could never match.

Solution

Native watcher

Architecture

Before: Watcher → watchexec (async config loop) → notify → inotify/FSEvents
After:  Watcher → notify → inotify/FSEvents (direct, synchronous)

The watcher is one Rust struct (packages/nx/src/native/watch/watcher.rs) that owns:

  • a notify::RecommendedWatcher (the OS-level subscription),
  • a single mpsc::Sender<ProcMsg> shared by the notify event handler and the napi force-flush method,
  • a long-running processing thread that reads from the corresponding Receiver.

How events are streamed

notify (OS) ─► NotifyForwarder ─┐
                                ├─► mpsc<ProcMsg> ─► processing thread ─► napi tsfn ─► JS callback
JS force_flush_pending() ───────┘                                       (or sync_channel reply)
  1. Notify side. notify::recommended_watcher(NotifyForwarder { tx }) hands every fs event to NotifyForwarder::handle_event, which wraps it as ProcMsg::Notify(event) and pushes it onto the channel. No work is done on the notify thread beyond the send.

  2. Processing thread. A dedicated thread runs an infinite loop with three branches:

    • ProcMsg::Notify(event) — filter through gitignore/nxignore (watch_filterer.rs), run the new-directory fast path on Linux/Windows if the event is a Create(Folder), transform notify's kinds into our EventType (Create/Update/Delete), and merge each resulting WatchEventInternal into a per-path HashMap accumulator. The merge rules are:
      • (_, Delete) → Delete (file is gone, final state wins)
      • (_, Create) → Create (file is in its initial state from our perspective; overrides earlier Update or Delete — the regression case 15-18 above)
      • (Delete, Update) → Update (file came back with new content)
      • (Create, Update) / (Update, Update) → keep existing (Create wins on Linux's IN_CREATE+IN_MODIFY pair)
    • ProcMsg::ForceFlush(reply) — drain whatever notify has buffered (while let Ok(msg) = rx.try_recv() so anything in flight at the moment of the request is included), collect every queued ForceFlush reply channel encountered during the drain, snapshot the accumulator, and send the same snapshot to all collected callers (closes refactor(utils):remove projectName from the arguments #14 above). Used by the daemon to absorb pending events before serving a cached project graph (closes the IDLE_WINDOW race where a 99 ms-old event would otherwise miss the next read).
    • Err(RecvTimeoutError::Timeout) — the wake-up deadline elapsed; emit the accumulator to JS via the napi ThreadsafeFunction (callback_tsfn.call(Ok(events), NonBlocking)), clear it, and reset.
  3. Trailing-edge debounce. Each Notify ingest updates a single flush_deadline = min(now + IDLE_WINDOW, burst_start + MAX_WAIT):

    • IDLE_WINDOW = 100 ms — flush when the channel goes quiet for this long. Resets on every arriving event, so any burst with gaps <100 ms coalesces into one flush.
    • MAX_WAIT = 500 ms — starvation cap from the first event of the burst.
    • The loop's rx.recv_timeout(wait) either picks up the next message or returns Timeout exactly at flush_deadline. No separate timers, no cross-flush state, no recent_paths book-keeping.
    • When idle (flush_deadline = None) the loop polls on SHUTDOWN_POLL so stop_flag is observed promptly.
  4. napi tsfn handoff. The processing thread invokes callback_tsfn.call(Ok(Vec<WatchEvent>), NonBlocking) to schedule the JS callback on Node's main thread. The notify thread itself never crosses into JS — only the processing thread does, and only at flush time. Watcher::watch is now a thin wrapper around an internal watch_inner that takes a generic callback, so unit tests can exercise the same loop without a JS runtime.

  5. force_flush_pending (synchronous drain). The daemon calls watcher.force_flush_pending() from JS before reading the cached project graph. JS-side napi method creates a sync_channel::<Vec<WatchEvent>>(1) reply pair and sends ProcMsg::ForceFlush(reply_tx) on the same channel notify events flow through. The processing thread sees the request in order with any buffered notify events, drains, takes the accumulator, and replies. Concurrent callers all receive the same snapshot (closes refactor(utils):remove projectName from the arguments #14).

Event classification (types.rs::transform_event_to_watch_events)

  • A failed stat (!meta_exists(metadata)) only short-circuits to EventType::Delete when the notify event_kind is actually Remove(_). For Modify/Create events whose stat happened to fail during a transient atomic-rename window, we fall through to the platform-specific path which derives the type from event_kind alone (closes #19a).
  • Modify(Name(RenameMode::Both | Any)) — the coalesced rename event notify-rs delivers in addition to per-side From/To events — is now skipped, so it can't override the From Delete on the source path of a rename (closes #19b).

New-directory fast path (Linux/Windows)

When notify reports a Create(Folder) on inotify/ReadDirectoryChangesW (which only watch the dirs they were given), the processing thread:

  1. Calls watcher.watch(new_dir, NonRecursive) synchronously — the OS watch is active the moment this returns.
  2. Re-walks the new directory and merges every existing entry into the accumulator with EventType::Create.
  3. Recursively registers any subdirectories it found (so a mkdir -p a/b/c/d still hooks every level).

Because (1) is synchronous, files written between mkdir and the watch() call are caught by the re-walk in (2). Notify events for those same files arriving later just merge into the same accumulator entry — no duplication. macOS doesn't need this path: FsEventWatcher is recursive.

notify_watcher is held as Option<Arc<Mutex<RecommendedWatcher>>> on the struct so the processing-thread closure can clone the Arc; otherwise the macOS move-closure (which has no cfg-active references) wouldn't capture the watcher and FsEventWatcher would drop the moment watch() returned.

Daemon

  • Plugin reload on config change. getPluginsSeparated clears pendingPluginsPromise and tears down workers when nx.json#plugins's hash changes.
  • Persist project graph after auto-recompute. New persistProjectGraphToDisk helper called from kickOffRecompute and getCachedSerializedProjectGraphPromise.
  • fd race fix. Open log handles into locals first, assign to this._out/this._err after both opens resolve, pass the local fds to spawn.
  • Two-way NX_* env reflection. New applyDaemonEnvFromClient(newEnv) helper writes new keys AND deletes any NX_-prefixed key the daemon has that the client doesn't (skipping daemon-side exclusions and required settings).
  • scheduleTimeoutId → in-flight promise for self-documenting recompute scheduling.
  • Single notifyFileWatcherSockets registration at daemon startup; one notification per batch.
  • Map new files by project root. getProjectsAndGlobalChanges uses createProjectRootMappings + findProjectForPath, cached by reference identity on currentProjectGraph.
  • Cache invalidation gated on first persist (closes Convert-to-workspace Schematic doesn't update protractor.conf #12). New cacheHasBeenPersisted flag set in persistProjectGraphToDisk. resetInternalStateIfNxDepsMissing returns early if the flag is false — before the first successful write, a missing project-graph.json is the expected state. Its catch branch no longer auto-resets on transient stat errors.
  • chainToLatest defensive kickOff (closes Workspace schematic should include code formatting #13). When a stale compute hits chainToLatest and finds cachedSerializedProjectGraphPromise === undefined (because resetInternalState ran), it now kicks off a successor synchronously and returns that promise instead of undefined. Prevents the "stale compute commits stale data" path that the falsy guard let through.

Test suite

  • NX_DAEMON='true' is the default in e2e/utils/command-utils.ts so CI exercises the daemon path.
  • Cache-hit assertions updated to [existing outputs match the cache, left as is] for the daemon-on no-op restore path. One assertion in cache.test.ts:216 reverted to [local cache] because that test adds an extra file to dist/, invalidating the outputs hash and forcing a real restore.
  • Maven batch tests assert on Successfully ran target X instead of BUILD SUCCESS. verbose: true dropped where it only added Maven -X debug noise.
  • Watch e2e harness uses tree-kill and waits for close before reading output. Default wait shortened from 6s/8s → 1s/2s now that debounce is deterministic.
  • Walker test uses std::os::unix::net::UnixListener::bind instead of nix::unistd::mkfifo — same is_hashable_file rejection, no nix feature flag needed.
  • 10 Rust watcher tests (packages/nx/src/native/watch/watcher.rs) drive Watcher::watch_inner end-to-end with real fs ops on a tempdir — inotify/FSEvents → notify-rs → ProcMsg channel → EventIngestor → accumulator → callback. Cover: git-style unlink+write, vim-style atomic rename, plain in-place update, fresh create, rm, create+rm, cross-name rename, multi-file burst coalescing, hardcoded-ignored paths never reach callback, and 8-way concurrent force_flush_pending (regression for refactor(utils):remove projectName from the arguments #14). Tests use canonicalized tempdir paths so the synthetic-gitignore filter scopes correctly on macOS where /tmp symlinks to /private/tmp.

Testing

  • pnpm nx run e2e-nx:e2e-ci--src/watch.test.ts --skip-nx-cache — 8/8 passed.
  • pnpm nx run e2e-js:e2e-ci--src/js-strip-types.test.ts — 3/3 (was failing on test 3 before the env-reflection fix).
  • pnpm nx run e2e-maven:e2e-ci--src/maven-batch.test.ts — 3/3 (was failing on test 1 + 3 before the assertion fix).
  • pnpm nx run e2e-release:e2e-ci--src/multiple-release-branches.test.ts — 2/2 (was failing both before the merge-event fix; reproduced locally and confirmed via the daemon log dump that git checkout was emitting a stray Delete for one of the package.json files).
  • pnpm nx run nx:test-native — 349 tests pass, including the 10 new watcher tests on Linux and macOS.
  • macOS path validated end-to-end with the Arc fix.

Related Issue(s)

N/A — flaky-test investigation that grew into a daemon-stability hardening pass.

@netlify

netlify Bot commented Apr 8, 2026

Copy link
Copy Markdown

Deploy Preview for nx-docs ready!

Name Link
🔨 Latest commit e5f1459
🔍 Latest deploy log https://app.netlify.com/projects/nx-docs/deploys/69f168787c46280008beb146
😎 Deploy Preview https://deploy-preview-35204--nx-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@netlify

netlify Bot commented Apr 8, 2026

Copy link
Copy Markdown

Deploy Preview for nx-dev ready!

Name Link
🔨 Latest commit e5f1459
🔍 Latest deploy log https://app.netlify.com/projects/nx-dev/deploys/69f1687805eaf500087d9d0a
😎 Deploy Preview https://deploy-preview-35204--nx-dev.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@nx-cloud

nx-cloud Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

View your CI Pipeline Execution ↗ for commit e5f1459

Command Status Duration Result
nx affected --targets=lint,test,build,e2e,e2e-c... ✅ Succeeded 42m 16s View ↗
nx run-many -t check-imports check-lock-files c... ✅ Succeeded 3s View ↗
nx-cloud record -- pnpm nx-cloud conformance:check ✅ Succeeded 16s View ↗
nx build workspace-plugin ✅ Succeeded <1s View ↗
nx-cloud record -- nx sync:check ✅ Succeeded 23s View ↗
nx-cloud record -- nx format:check ✅ Succeeded 7s View ↗

☁️ Nx Cloud last updated this comment at 2026-04-29 02:55:54 UTC

nx-cloud[bot]

This comment was marked as outdated.

@FrozenPandaz FrozenPandaz force-pushed the investigate/watch-test-flaky branch 3 times, most recently from bc19187 to de9f51e Compare April 9, 2026 03:10
@FrozenPandaz FrozenPandaz force-pushed the investigate/watch-test-flaky branch from dbd39d2 to 2106fcc Compare April 20, 2026 19:12
nx-cloud[bot]

This comment was marked as outdated.

@FrozenPandaz FrozenPandaz force-pushed the investigate/watch-test-flaky branch from 2106fcc to 1364b2b Compare April 20, 2026 21:03
nx-cloud[bot]

This comment was marked as outdated.

@FrozenPandaz FrozenPandaz force-pushed the investigate/watch-test-flaky branch from b6fe6fe to 1346c1a Compare April 22, 2026 21:08
nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

@FrozenPandaz FrozenPandaz force-pushed the investigate/watch-test-flaky branch from 4d1c720 to 142e8dd Compare April 24, 2026 18:46
nx-cloud[bot]

This comment was marked as outdated.

@FrozenPandaz FrozenPandaz force-pushed the investigate/watch-test-flaky branch from 142e8dd to fe18068 Compare April 24, 2026 20:59
nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

@FrozenPandaz FrozenPandaz force-pushed the investigate/watch-test-flaky branch from fb4170a to 9693e88 Compare April 26, 2026 05:36
When resetInternalState cleared cachedSerializedProjectGraphPromise,
an in-flight stale compute that hit chainToLatest read it back as
undefined. The 'if (stale) return stale' guard treats undefined as
falsy, so the compute fell through and committed stale/partial data
to module state.

Kick off a successor compute inside resetInternalState (and as a
defensive fallback inside chainToLatest) so a real promise is always
available for in-flight stale computes to chain onto. Also adds
[CI-DEBUG] logs around resetInternalStateIfNxDepsMissing so we can
see which path triggered a reset in CI logs.
…persist

resetInternalStateIfNxDepsMissing was triggering on every cold-start
race: while the first compute was still mid-flight,
project-graph.json didn't exist yet AND cachedPromise was set, so the
condition matched. Reset would tear down the in-flight compute and
force a redundant recompute — wasteful, and could leave callers
awaiting a torn-down promise mid-write.

Gate the reset on a cacheHasBeenPersisted flag set only after a
successful persistProjectGraphToDisk. Now 'project-graph.json missing'
only counts as an external wipe (rather than 'we haven't written
yet'). Also stop unconditionally resetting on stat errors — a
transient FS hiccup shouldn't nuke daemon state.
The reset is called either from a failing compute's catch (which is
already returning errorResult) or from a confirmed external wipe.
In both cases, the next CLI request or watcher event will trigger a
fresh kickOffRecompute organically. The chainToLatest defensive path
already handles the narrow case of a concurrent stale compute hitting
an undefined cache mid-reset.
… Remove kind

Two related Rust-side fixes for silent file drops during 'git checkout':

1. merge_event no longer prioritizes Delete. The accumulator now keeps
   only the most recent observation per path. Previously 'Delete always
   wins' meant an unlink+create sequence (what git uses for some
   tracked-file updates) ended up in the accumulator as Delete only —
   the workspace context's index then removed the (still-existing) file,
   and downstream isProjectPublic excluded the project from release
   because its package.json was missing from the file map.

2. transform_event_to_watch_events only short-circuits to Delete on a
   stat failure when the notify event_kind is actually Remove. Other
   kinds (Modify, Create, rename-To) can transiently fail their stat
   during atomic-rename windows; falling through to the platform path
   classifies them correctly from event_kind alone. The macOS fallback
   for Err metadata changed from Delete to Update for the same reason.

Adds 13 tests to packages/nx/src/native/watch/watcher.rs:
- 10 unit tests for merge_event covering every transition.
- 3 real-fs integration tests (tempdir + notify) verifying that
  unlink+create, plain update, and rm produce the expected accumulator
  state end-to-end.
Refactors Watcher::watch into a thin napi wrapper around watch_inner,
which takes a generic Box<dyn Fn> callback instead of a napi
ThreadsafeFunction. Tests can now exercise the full pipeline (inotify
-> notify-rs -> ProcMsg channel -> EventIngestor -> accumulator ->
idle-window flush -> callback) without a JS runtime.

Replaces the earlier mix of pure-rust merge_event unit tests and a
parallel notify-rs setup with five end-to-end tests that perform real
fs operations through tempfile and assert on what the watcher's own
callback emits:

- unlink_then_create_does_not_emit_delete (the regression)
- plain_update_does_not_emit_delete (sanity)
- rm_yields_delete (sanity)
- create_then_rm_yields_delete (sanity)
- fresh_create_does_not_emit_delete (sanity)

Each test waits one IDLE_WINDOW + a small margin after its fs ops
before reading the captured events, so timing is bounded by the real
inotify -> channel hop.
Previously the processing thread dropped any extra ForceFlush messages
it found while draining the channel, leaving their reply senders to be
discarded. Concurrent callers waited the full 500ms recv_timeout in
force_flush_pending() and returned an empty Vec — under load (multiple
daemon clients) some flushes silently saw stale results.

Collect every queued reply channel, snapshot the accumulator once, and
send the same snapshot to all of them. Reset the accumulator only when
at least one reply was delivered (matching the prior 'if every caller
gave up, retain events for the next flush' contract).

Also strips the [CI-DEBUG] instrumentation we added while diagnosing
the watcher regression — the daemon log dump in the e2e test, the
per-event log in dispatchWorkspaceChanges, and ~20 [CI-DEBUG] lines in
project-graph-incremental-recomputation.ts. The actual fixes from that
investigation (cacheHasBeenPersisted gate, chainToLatest defensive
kickOff, last-wins merge_event, EventKind::Remove gate in
transform_event_to_watch_events) stay.

Adds a regression test that fires 8 concurrent force_flush_pending
calls and asserts the whole batch finishes well under the 500ms
timeout — proving every caller got a reply.
The 'last event wins' rule made fresh writes classify as Update on
Linux because fs::write fires both IN_CREATE and IN_MODIFY in close
succession; with last-wins, the Update overrode the Create and we lost
the more informative classification.

The Create-over-Update precedence existed in the original logic; the
bug was missing rules for transitions involving Delete. The new table
keeps the original Create-over-Update rule and adds:
  (Delete, Create) → Create  (file came back, e.g. git unlink+create)
  (Delete, Update) → Update  (file came back with new content)

Adds tests for the two atomic-update patterns we care about:
- git_style_unlink_then_write_yields_create (the original regression)
- vim_style_atomic_rename_yields_create (write to .swp + rename over)
- fresh_create_yields_create (tightened from 'not Delete' to 'Create')

All 7 watcher tests now drive the public Watcher API end-to-end with
real fs ops on a tempdir.
…assification

The notify-rs inotify backend fires three events for a single fs::rename:
  - Modify(Name(From)) for the source path
  - Modify(Name(To)) for the destination path
  - Modify(Name(Both)) coalesced event with both paths

The third event was falling through to the generic Modify(_) → Update
arm in create_watch_event_internal, overriding the source path's Delete
with an Update. Skip RenameMode::Both and RenameMode::Any explicitly —
the From/To pair already classifies correctly.

Restore the macOS path's Err(_) → EventType::delete classification.
FSEvents on macOS doesn't reliably surface EventKind::Remove(_) for
file removals, so the cross-platform early-return that catches Linux
removals can miss them; falling back to Delete on stat failure matches
the pre-fix behavior so real removals classify correctly there.

Adds three new fs-event tests through the public Watcher API:
  - multi_file_burst_all_appear_in_one_flush
  - rename_yields_delete_for_old_and_create_for_new
  - hardcoded_ignored_paths_never_reach_callback

The strict-type fs-classification tests are gated to non-macOS because
FSEvents has different semantics — it coalesces events and uses
st_mtime vs st_birthtime to differentiate Create from Update, so the
exact classification diverges from the inotify-derived one even when
the 'file exists vs gone' answer agrees. The platform-agnostic
concurrent_force_flush and hardcoded_ignored_paths tests still run on
all platforms.

Adds a Drop impl on Watcher that sets stop_flag so the processing
thread winds down within ~1s of the watcher going out of scope rather
than living until the host process exits — useful for tests that
spawn many short-lived watchers.
On macOS `/tmp` is a symlink to `/private/tmp`, so `tempdir()` hands
back `/tmp/.tmpXXX` but FSEvents (and notify-rs's path
canonicalization) deliver events as `/private/tmp/.tmpXXX/...`. The
WatchFilterer scopes the synthetic gitignore (built from the hardcoded
ignore patterns) by `path.starts_with(origin)`. With the symlinked
origin this check fails — events under `node_modules` etc. slipped
past the filter and reached the callback, breaking
hardcoded_ignored_paths_never_reach_callback on macOS.

Canonicalize the test's origin so both sides agree. On Linux this is
a no-op since tempdir paths are already canonical.

Removes the cfg(not(target_os = "macos")) gates on the strict-type
fs-classification tests now that they pass on macOS too.
@FrozenPandaz FrozenPandaz force-pushed the investigate/watch-test-flaky branch from ff1bcf9 to 8108cd8 Compare April 28, 2026 23:07
Refactors the watcher's threading model so the flush loop owns
everything it needs and the Watcher struct only holds the bits the
outside world talks to.

Before, the calling thread created the filterer and notify watcher,
wrapped the watcher in Arc<Mutex<>> so the spawned thread could
register new directory watches, then handed both to the thread. The
struct kept a notify_watcher field purely to keep the FSEvents channel
sender alive on macOS (a workaround for the closure not capturing the
watcher when no cfg-active references referenced it on darwin).

Now the flush loop creates the filterer and notify watcher locally on
its own stack, registers watch paths, signals readiness via a oneshot
channel, and then runs a crossbeam_channel::select! loop:

  - notify_rx (internal to the loop) carries notify events.
  - force_flush_rx (struct holds the tx) carries force-flush requests.
  - default(deadline) drives the IDLE_WINDOW + MAX_WAIT debounce.

Shutdown is now implicit. When the Watcher struct drops or stop() is
called, the only external sender (the force-flush tx) drops; the
loop's force_flush_rx select arm reports Disconnected and the loop
exits at most one debounce window later. The notify watcher drops
with the loop's stack, releasing the OS subscription. No more
stop_flag, no more SHUTDOWN_POLL, no more Drop impl, no more
Arc<Mutex<>>, no more notify_watcher field on the struct.

The struct collapses to:
  - origin / additional_globs / use_ignore (config the loop reads at
    startup)
  - force_flush_tx: Mutex<Option<Sender<...>>> (interior mutability so
    stop() can drop the sender via &self)

Other touch-ups:
- ProcMsg enum gone — separate channels carry separate concerns.
- EventIngestor no longer holds the watcher; the Linux/Windows
  new-directory backfill takes &mut watcher as a parameter on ingest.
- register_watches/register_and_backfill_new_dirs take &mut watcher
  directly; no more Mutex locking.
- Imports collapsed into multi-item use statements.

All 10 watcher tests still pass through the public Watcher API.
- Drop EventIngestor struct: it was a thin wrapper around references that
  already live in run_flush_loop's scope. Replace with a free function
  ingest_event taking the references as parameters.
- Use crate::native::logger::enable_logger() instead of inlining
  tracing_subscriber setup.
- Trim verbose comments. Each remaining one explains a non-obvious WHY:
  the merge_event rules, the force_flush_tx Mutex<Option> rationale,
  the ready-signal handshake, the run_flush_loop ownership story.
  Narrative comments that re-stated visible code structure are gone.
…dler

notify-rs blanket-implements EventHandler for FnMut, so a one-line
closure does the same job as the NotifyForwarder struct + its
EventHandler impl. The struct was just a wrapper around 'send the
event into a channel', which is the closure body now.

(notify-rs also has a built-in EventHandler impl for
crossbeam_channel::Sender, but it's gated behind the 'crossbeam-channel'
feature which we don't enable.)
WatchPipeline::new takes ownership of session setup (filterer, notify
watcher, registered paths) on the calling thread. If anything fails
it returns Result<Self, String>, so watch_inner can surface startup
errors directly via its Result return — no more inline 'match { Err(e)
=> ready_tx.send; return; }' patterns and no ready-channel handshake
needed. The pipeline is then moved into the spawned thread that runs
run_flush_loop.

run_flush_loop is now just the select-loop body + accumulator state.
It destructures the pipeline at entry so the loop body reads with
plain locals (filterer, notify_rx, etc.) rather than session.field.
…ction

The pipeline is the thing that runs. `pipeline.run(...)` reads more
naturally than a free function taking the pipeline as its first
parameter. Behavior unchanged.
…via self

The destructure was rebinding self.field to bare locals so the loop
body could read 'filterer' instead of 'self.filterer'. Drop it and
access fields directly through 'mut self' — the borrow checker handles
disjoint-field borrows, so &mut self.watcher coexists with &self.filterer
in the same call. Same behavior, one fewer let binding to read past.
ingest_event was a free function with 9 parameters, 5 of which were
WatchPipeline fields. Making it a &mut self method drops those 5
parameters from every call site — the loop body's two ingest calls
shrink from 8 lines each to 5.

The crossbeam_channel::select! recv arm releases its receiver borrow
before the match arm body runs, so calling self.ingest_event(...)
inside the Ok arm doesn't conflict with recv(self.notify_rx).
register_watches now propagates notify::ErrorKind::MaxFilesWatch (the
notify-rs name for inotify ENOSPC). At startup the error bubbles
through WatchPipeline::new's Result; at runtime ingest_event surfaces
it as a callback Err with a message containing 'inotify_add_watch' so
the daemon-side fallback at project-graph.ts:432 fires (it falls back
to non-daemon mode when this substring appears in the error).

Other watch errors keep their warn-level log: a single bad path
shouldn't disable the entire watcher.

Also moves new_directories_from_event and register_and_backfill_new_dirs
to be methods on WatchPipeline so they no longer thread filterer /
ignore_globs / origin / watcher through their parameter lists.
- accumulator, burst_start, flush_deadline are now fields on WatchPipeline
  rather than locals in run(). merge_event, snapshot_events, reset_burst,
  ingest_event become &mut self methods that no longer thread the
  accumulator etc. through their parameter lists.
- new_directories_from_event and register_and_backfill_new_dirs also
  become methods so they don't pass filterer/origin/ignore_globs/watcher
  through every call.
- Path import hoisted to the top of the file (was cfg-gated to non-macos).
- Unused tracing::trace import removed.
- Docstrings and inline comments trimmed across the file: keep the WHY
  notes for the merge_event rules and the inotify_add_watch error string,
  drop narrative that re-stated visible code structure.
Four logs visible at NX_NATIVE_LOGGING=debug:
  - 'watching started' (Watcher::watch_inner)
  - 'watching stopped' (Watcher::stop)
  - 'registering watches for new directories' (mid-flight new-dir
    backfill on Linux/Windows)
  - 'idle-window emitting events' / 'force-flush emitting events'
    (each callback invocation, with event count)
Strip the watcher-origin prefix at construction time so the internal
event struct holds an already-relative path, instead of carrying the
full origin string on every event for later stripping at the JS
boundary. Public WatchEvent shape is unchanged.
@github-actions

Copy link
Copy Markdown
Contributor

🐳 We have a release for that!

This PR has a release associated with it. You can try it out using this command:

npx create-nx-workspace@23.0.0-pr.35204.e5f1459 my-workspace

Or just copy this version and use it in your own command:

23.0.0-pr.35204.e5f1459
Release details 📑
Published version 23.0.0-pr.35204.e5f1459
Triggered by @FrozenPandaz
Branch investigate/watch-test-flaky
Commit e5f1459
Workflow run 25091711380

To request a new release for this pull request, mention someone from the Nx team or the @nrwl/nx-pipelines-reviewers.

@FrozenPandaz FrozenPandaz merged commit 5323ed4 into master Apr 29, 2026
24 checks passed
@FrozenPandaz FrozenPandaz deleted the investigate/watch-test-flaky branch April 29, 2026 16:05
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

This pull request has already been merged/closed. If you experience issues related to these changes, please open a new issue referencing this pull request.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators May 5, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workspace schematic should include code formatting Convert-to-workspace Schematic doesn't update protractor.conf

2 participants