Skip to content

Fix the unstable unit test case: ResourceBoundedExecutorSuite#13906

Closed
sperlingxx wants to merge 2 commits into
NVIDIA:mainfrom
sperlingxx:quick_fix_unstable_case
Closed

Fix the unstable unit test case: ResourceBoundedExecutorSuite#13906
sperlingxx wants to merge 2 commits into
NVIDIA:mainfrom
sperlingxx:quick_fix_unstable_case

Conversation

@sperlingxx
Copy link
Copy Markdown
Collaborator

Fixes #13788

This PR stabilizes a flaky unit test whose output was nondeterministic due to runtime fluctuations by introducing a startup lock to enforce a deterministic execution order.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Nov 28, 2025

Greptile Overview

Greptile Summary

Fixed flaky AsyncCpuTask task priority test by replacing unreliable sleep-based timing with explicit lock-based synchronization. The test was failing intermittently because tasks could start executing before all tasks were submitted, causing nondeterministic execution order due to runtime fluctuations. The fix introduces a ReentrantLock that blocks all tasks from starting until after submission completes, ensuring deterministic priority-based scheduling.

Key changes:

  • Added ReentrantLock parameter to buildDummyFn to control task execution start
  • Each test section now locks before submission, submits a dummy blocking task, waits 1ms, submits priority tasks, then unlocks
  • Tasks remain blocked at entry until lock is released, guaranteeing queue formation before execution
  • Replaced sleep-duration parameter with mandatory lock parameter

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it fixes a test stability issue without affecting production code
  • The fix correctly addresses the root cause of test flakiness using proper synchronization primitives. The lock mechanism ensures deterministic task ordering by preventing early execution. One minor issue: a comment on line 110 mentions 6 tasks but only 5 are validated, though this doesn't affect correctness
  • No files require special attention - this is a test-only change

Important Files Changed

File Analysis

Filename Score Overview
tests/src/test/scala/com/nvidia/spark/rapids/ResourceBoundedExecutorSuite.scala 4/5 Replaced sleep-based timing with lock-based synchronization to fix flaky test, preventing race conditions in task execution order validation

Sequence Diagram

sequenceDiagram
    participant Test as Test Thread
    participant Lock as ReentrantLock
    participant Executor as ResourceBoundedExecutor
    participant DummyTask as Dummy Blocking Task
    participant Task1 as Task 1
    participant Task2 as Task 2
    participant TaskN as Task N...
    
    Test->>Lock: lock() - acquire lock
    Note over Lock: Lock held by test thread
    Test->>Executor: submit(DummyTask)
    Test->>Test: Thread.sleep(1ms)
    Note over Test: Ensure dummy task queued first
    Test->>Executor: submit(Task1 with priority/memory)
    Test->>Executor: submit(Task2 with priority/memory)
    Test->>Executor: submit(TaskN...)
    Note over Executor: Tasks queue by priority
    Test->>Lock: unlock() - release lock
    Note over Lock: Lock released
    
    Executor->>DummyTask: execute()
    DummyTask->>Lock: lock() - blocks until available
    DummyTask->>Lock: unlock()
    DummyTask-->>Executor: completes
    
    Executor->>Task1: execute() (highest priority)
    Task1->>Lock: lock() - acquires immediately
    Task1->>Task1: sleep(5ms)
    Task1->>Task1: nanoTime()
    Task1->>Lock: unlock()
    Task1-->>Executor: returns timestamp
    
    Executor->>Task2: execute() (next priority)
    Task2->>Lock: lock() - acquires immediately
    Task2->>Task2: sleep(5ms)
    Task2->>Task2: nanoTime()
    Task2->>Lock: unlock()
    Task2-->>Executor: returns timestamp
    
    Executor->>TaskN: execute() (remaining by priority)
    TaskN->>Lock: lock()
    TaskN->>TaskN: sleep(5ms) + nanoTime()
    TaskN->>Lock: unlock()
    TaskN-->>Executor: returns timestamp
    
    Test->>Test: Verify timestamps in expected order
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@sperlingxx
Copy link
Copy Markdown
Collaborator Author

build

@sameerz sameerz added bug Something isn't working test Only impacts tests labels Nov 30, 2025
@pxLi
Copy link
Copy Markdown
Member

pxLi commented Dec 1, 2025

this is also failing 25.12, please help re-target to release/25.12, thanks~

@sperlingxx sperlingxx requested a review from tgravescs December 1, 2025 01:24
firestarman
firestarman previously approved these changes Dec 1, 2025
Copy link
Copy Markdown
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit

* completion of the task.
*/
private def buildDummyFn(sleepMs: Long = 5L): () => Long = {
def buildDummyFn(startBlk: Lock): () => Long = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: why remove the "private" key word ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just happened to remove it. Meaningless change..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Copy link
Copy Markdown
Member

@pxLi pxLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please help retarget to release/25.12 thanks~

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

}

// Comprehensive test for task priority and memory usage (including unbounded tasks):
// Execution order: 1, 4, 6, 2, 5, 3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: comment doesn't match actual test - says "Execution order: 1, 4, 6, 2, 5, 3" (6 tasks) but code only tracks 5 tasks in futures array (line 130 uses Array(2, 3, 5, 1, 4)). The dummy blocking task isn't included in validation.

Suggested change
// Execution order: 1, 4, 6, 2, 5, 3
// Execution order: 1, 4, 2, 5, 3

@pxLi
Copy link
Copy Markdown
Member

pxLi commented Dec 1, 2025

#13910 will be automerged into main if no merge conflict, why we have another PR here?

Recommend to retarget this to release/25.12 directly

@sperlingxx
Copy link
Copy Markdown
Collaborator Author

#13910 will be automerged into main if no merge conflict, why we have another PR here?

Yes, I am about to close this one.

@sperlingxx
Copy link
Copy Markdown
Collaborator Author

Close this PR since #13910 will cover the main branch as well

@sperlingxx sperlingxx closed this Dec 1, 2025
@pxLi
Copy link
Copy Markdown
Member

pxLi commented Dec 1, 2025

#13910 will be automerged into main if no merge conflict, why we have another PR here?

Yes, I am about to close this one.

OK, prefer to keep review history by simple retarget this change instead of filing another one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working test Only impacts tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Scala test 'AsyncCpuTask task priority' failed UT

5 participants