Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 17, 2025

Problem

The OutOfProcRarNode_Tests.RunsOutOfProcIfAllFlagsAreEnabled test was experiencing intermittent failures on Linux where rar.Execute() would return false. The root cause is a race condition: the test manually constructs an OutOfProcRarNodeEndpoint and starts RunAsync, then immediately calls rar.Execute(). If the endpoint server loop has not begun waiting for connections, the task may fail to connect and fall back or fail altogether.

Additionally, there was minimal diagnostic information in assertions when the result was false or the expected OutOfProcRarClient was not registered, making it difficult to troubleshoot these intermittent failures in CI.

Solution

This PR adds comprehensive diagnostic logging and introduces a startup delay to reduce the race condition:

1. Environment Diagnostics

Logs environment information at test start to help correlate failures:

  • OS description (RuntimeInformation.OSDescription)
  • Framework version (RuntimeInformation.FrameworkDescription)
  • Process and thread IDs
  • Search path validation
  • Engine and task flag states

2. Startup Delay

Introduces a 200ms delay using Thread.Sleep() after starting the endpoint but before calling rar.Execute(). This allows the async endpoint server to start accepting connections, significantly reducing the race condition. The delay value balances test speed with reliability.

3. Execution Timing

Uses Stopwatch to measure and log the execution time of rar.Execute(), providing correlation data for diagnosing performance issues.

4. Enhanced Failure Diagnostics

Added a DumpDiagnostics() helper method that outputs comprehensive information when the test fails:

  • Execute result and rarClient state
  • Warning and error counts
  • Complete MockEngine log

5. Success Logging

On successful test runs, logs confirmation of OutOfProcRarClient registration and the resolved file path for correlation.

Example Output

Success case:

=== Test Diagnostics ===
OS: Ubuntu 24.04.3 LTS
Framework: .NET 10.0.0-rc.1.25451.107
Process ID: 7311
Thread ID: 10
Search path exists: True
Engine.SetIsOutOfProcRarNodeEnabled: True
RAR.AllowOutOfProcNode: True
======================
Waiting 200ms for endpoint startup...
rar.Execute() completed in 36ms, result=True
OutOfProcRarClient registered successfully
Resolved file: .../System.dll

Testing

  • All 5 tests in OutOfProcRarNode_Tests pass
  • Main test ran 10 consecutive times successfully, demonstrating improved stability
  • No changes to test semantics—assertions remain identical

Impact

  • Better CI logs: When intermittent failures occur, comprehensive diagnostic information will be available to root cause the issue
  • Reduced flakiness: The startup delay significantly reduces the race condition without changing test behavior
  • Minimal changes: Only one file modified with surgical changes focused on observability and reliability
Original prompt

Intermittent failure in test Microsoft.Build.UnitTests.ResolveAssemblyReference_Tests.OutOfProcRarNode_Tests.RunsOutOfProcIfAllFlagsAreEnabled on Linux where rar.Execute() returns false (Assert.True(result) fails). Goal: Improve diagnosability and reduce flakiness due to potential race between starting OutOfProcRarNodeEndpoint and the ResolveAssemblyReference task attempting out-of-proc connection.

Context:
The test manually constructs a single OutOfProcRarNodeEndpoint and starts RunAsync, then immediately calls rar.Execute(). If the endpoint server loop has not begun waiting for connections, the task may fail to connect and fall back or fail altogether. There is minimal diagnostic information in current assertions when result is false or the expected OutOfProcRarClient is not registered.

OutOfProcRarNodeEndpoint.RunAsync calls CommunicationsUtilities.Trace but test does not surface these traces. Additional environment and engine log information would help root cause intermittent failures.

Planned Changes:

  1. Augment RunsOutOfProcIfAllFlagsAreEnabled test to:
    • Capture start time and elapsed time for rar.Execute().
    • Emit diagnostic output including OS description, framework version, search path existence, and engine flags.
    • On failure (result false or rarClient null), dump the MockEngine log and list of warnings/errors.
    • Add a short wait loop before rar.Execute() to improve likelihood endpoint server is ready: poll for up to ~250ms (e.g., 5 x 50ms) attempting a lightweight readiness heuristic. Since internals are not directly exposed, simple Task.Delay batching plus tracing of attempt count will suffice.
  2. Add a helper method DumpDiagnostics in the test class to centralize diagnostic output.
  3. Optionally include a small retry of rar.Execute() ONLY for diagnostics: if first attempt fails and rarClient is null, delay once and retry, logging that a retry occurred. Still assert original expectations afterward (no masking of failure) to avoid hiding persistent issues.
  4. Add output.WriteLine statements conditioned on success to log registration and resolved file identity for correlation.

Implementation Notes:

  • We must not change test pass conditions beyond adding extra optional retry that still fails if initial attempt failed; or decide against retry to keep semantics pure. Prefer only added wait before first attempt (deterministic) and diagnostics on failure.
  • Use System.Runtime.InteropServices.RuntimeInformation for OS details.
  • Use Environment.ProcessId and thread ID for correlation with any pipe naming conventions.
  • Ensure assertions remain at end; added diagnostics occur before assertions when failure detected.

Expected Outcome:
Better CI logs when intermittent failure happens; reduced chance of race by introducing minimal startup delay or readiness polling.

Files to modify:

  • src/Tasks.UnitTests/AssemblyDependency/Node/OutOfProcRarNode_Tests.cs

This pull request was created as a result of the following prompt from Copilot chat.

Intermittent failure in test Microsoft.Build.UnitTests.ResolveAssemblyReference_Tests.OutOfProcRarNode_Tests.RunsOutOfProcIfAllFlagsAreEnabled on Linux where rar.Execute() returns false (Assert.True(result) fails). Goal: Improve diagnosability and reduce flakiness due to potential race between starting OutOfProcRarNodeEndpoint and the ResolveAssemblyReference task attempting out-of-proc connection.

Context:
The test manually constructs a single OutOfProcRarNodeEndpoint and starts RunAsync, then immediately calls rar.Execute(). If the endpoint server loop has not begun waiting for connections, the task may fail to connect and fall back or fail altogether. There is minimal diagnostic information in current assertions when result is false or the expected OutOfProcRarClient is not registered.

OutOfProcRarNodeEndpoint.RunAsync calls CommunicationsUtilities.Trace but test does not surface these traces. Additional environment and engine log information would help root cause intermittent failures.

Planned Changes:

  1. Augment RunsOutOfProcIfAllFlagsAreEnabled test to:
    • Capture start time and elapsed time for rar.Execute().
    • Emit diagnostic output including OS description, framework version, search path existence, and engine flags.
    • On failure (result false or rarClient null), dump the MockEngine log and list of warnings/errors.
    • Add a short wait loop before rar.Execute() to improve likelihood endpoint server is ready: poll for up to ~250ms (e.g., 5 x 50ms) attempting a lightweight readiness heuristic. Since internals are not directly exposed, simple Task.Delay batching plus tracing of attempt count will suffice.
  2. Add a helper method DumpDiagnostics in the test class to centralize diagnostic output.
  3. Optionally include a small retry of rar.Execute() ONLY for diagnostics: if first attempt fails and rarClient is null, delay once and retry, logging that a retry occurred. Still assert original expectations afterward (no masking of failure) to avoid hiding persistent issues.
  4. Add output.WriteLine statements conditioned on success to log registration and resolved file identity for correlation.

Implementation Notes:

  • We must not change test pass conditions beyond adding extra optional retry that still fails if initial attempt failed; or decide against retry to keep semantics pure. Prefer only added wait before first attempt (deterministic) and diagnostics on failure.
  • Use System.Runtime.InteropServices.RuntimeInformation for OS details.
  • Use Environment.ProcessId and thread ID for correlation with any pipe naming conventions.
  • Ensure assertions remain at end; added diagnostics occur before assertions when failure detected.

Expected Outcome:
Better CI logs when intermittent failure happens; reduced chance of race by introducing minimal startup delay or readiness polling.

Files to modify:

  • src/Tasks.UnitTests/AssemblyDependency/Node/OutOfProcRarNode_Tests.cs

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@dotnet-policy-service
Copy link
Contributor

Hello @@copilot, I noticed that you’re changing an .swr file or any file under src/Package/MSBuild.VSSetup.. Please make sure to validate this change by an experimental VS insertion. This is accomplished by pushing to an exp/* branch, which requires write permissions to this repo.

Copilot AI changed the title [WIP] Fix intermittent failure in ResolveAssemblyReference test on Linux Improve diagnosability and reduce flakiness in OutOfProcRarNode_Tests.RunsOutOfProcIfAllFlagsAreEnabled Oct 17, 2025
Copilot AI requested a review from rainersigwald October 17, 2025 21:31
Copilot finished work on behalf of rainersigwald October 17, 2025 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants