Skip to content

Fix the unstable unit test case: ResourceBoundedExecutorSuite#13910

Merged
sperlingxx merged 3 commits into
NVIDIA:release/25.12from
sperlingxx:quick_fix_unstable_case_2512
Dec 1, 2025
Merged

Fix the unstable unit test case: ResourceBoundedExecutorSuite#13910
sperlingxx merged 3 commits into
NVIDIA:release/25.12from
sperlingxx:quick_fix_unstable_case_2512

Conversation

@sperlingxx
Copy link
Copy Markdown
Collaborator

Fixes #13788

Mirror PR of #13906, which enforces the fix on branch-25.12 as well.

This PR stabilizes a flaky unit test whose output was nondeterministic due to runtime fluctuations by introducing a startup lock to enforce a deterministic execution order.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Dec 1, 2025

Greptile Overview

Greptile Summary

This PR fixes a flaky unit test (ResourceBoundedExecutorSuite) by introducing a ReentrantLock to control task execution timing and ensure deterministic ordering.

Key Changes:

  • Modified buildDummyFn to accept a Lock parameter that blocks task execution until released
  • Added lock acquisition before task submission and release after all tasks are queued
  • Removed reliance on Thread.sleep() timing assumptions for execution order
  • Updated test structure to submit a dummy unbounded task first, then lock-protected tasks
  • Adjusted comprehensive test from 6 tasks to 5 tasks (removed duplicate unbounded task)

Issues Found:

  • Minor: Comment on line 110 describing execution order appears incorrect and doesn't match the actual expected sequence

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it only modifies test code with a solid synchronization approach
  • The fix properly addresses the flaky test by replacing timing-based assumptions with explicit synchronization using ReentrantLock. The approach is sound and only affects test code, not production logic. One minor comment error was found but doesn't impact functionality.
  • No files require special attention - the single test file change is straightforward with proper lock usage

Important Files Changed

File Analysis

Filename Score Overview
tests/src/test/scala/com/nvidia/spark/rapids/ResourceBoundedExecutorSuite.scala 4/5 Introduces startup lock mechanism to stabilize flaky test; minor comment error found

Sequence Diagram

sequenceDiagram
    participant Test as Test Thread
    participant Lock as ReentrantLock
    participant Executor as ResourceBoundedExecutor
    participant Queue as Task Queue
    participant Worker as Worker Thread

    Note over Test,Worker: Test Setup Phase
    Test->>Lock: lock() - acquire startup lock
    Test->>Executor: submit unbounded dummy task
    Executor->>Queue: enqueue dummy task
    Test->>Test: sleep(1ms) - ensure dummy task queued first
    
    Note over Test,Worker: Task Submission Phase
    loop For each priority task
        Test->>Executor: submit task with priority
        Executor->>Queue: enqueue task (sorted by priority)
    end
    
    Note over Queue: Tasks queued by priority<br/>but blocked by lock
    
    Note over Test,Worker: Execution Phase
    Test->>Lock: unlock() - release startup lock
    
    Note over Worker: Worker can now start tasks
    loop For each task in priority order
        Worker->>Queue: dequeue highest priority task
        Worker->>Lock: task attempts lock()
        Lock->>Worker: grants lock
        Worker->>Worker: sleep(5ms)
        Worker->>Worker: capture System.nanoTime()
        Worker->>Lock: unlock()
        Worker->>Test: return timestamp via future
    end
    
    Note over Test: Verify timestamps are<br/>monotonically ordered<br/>per priority
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

}

// Comprehensive test for task priority and memory usage (including unbounded tasks):
// Execution order: 1, 4, 6, 2, 5, 3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Comment mentions task 6, but only 5 tasks are submitted (tasks 1-5). Should be "Execution order: 4, 1, 2, 5, 3"

Suggested change
// Execution order: 1, 4, 6, 2, 5, 3
// Execution order: 4, 1, 2, 5, 3

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

firestarman
firestarman previously approved these changes Dec 1, 2025
futures += executor.submit(AsyncRunner.newUnboundedTask(buildDummyFn()))
results = Array(1, 4, 6, 2, 5, 3).zip(futures).sortBy(_._1).map {
lck.unlock() // Allow tasks to start executing.
results = Array(2, 3, 5, 1, 4).zip(futures).sortBy(_._1).map {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: missing 6 here compared to the original array.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@firestarman , yes, since I changed the test sample as well. Currently, only 5 samples are included. Description was also updated.

pxLi
pxLi previously approved these changes Dec 1, 2025
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx sperlingxx dismissed stale reviews from pxLi and firestarman via 8ba09df December 1, 2025 01:49
@sperlingxx
Copy link
Copy Markdown
Collaborator Author

build

Copy link
Copy Markdown
Collaborator

@res-life res-life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@res-life
Copy link
Copy Markdown
Collaborator

res-life commented Dec 1, 2025

LGTM, but we may improve this PR in a follow-up PR.

I have the following concerns:

Tasks with higher priority should be executed first even if they are submitted later.

But the buildDummyFn returns the complete time not the start time.

Maybe we can:
Create the priority single thread pool, then commit a long time(may be 1s) running task.
The long time running task will make sure the followng submitted tasks in the thread pool queue.
Then submit some tasks and the priority thread pool will reorder the tasks in the task queue.
And we only need to record the task start time to verify the priority thread pool works fine.
In this way, we do not need to introduce a lock.
By short: do not verify the first task, only verify the following tasks.

@sameerz sameerz added bug Something isn't working test Only impacts tests labels Dec 1, 2025
@sperlingxx
Copy link
Copy Markdown
Collaborator Author

LGTM, but we may improve this PR in a follow-up PR.

I have the following concerns:

Tasks with higher priority should be executed first even if they are submitted later.

But the buildDummyFn returns the complete time not the start time.

Maybe we can: Create the priority single thread pool, then commit a long time(may be 1s) running task. The long time running task will make sure the followng submitted tasks in the thread pool queue. Then submit some tasks and the priority thread pool will reorder the tasks in the task queue. And we only need to record the task start time to verify the priority thread pool works fine. In this way, we do not need to introduce a lock. By short: do not verify the first task, only verify the following tasks.

Good idea! Thanks!

@sperlingxx sperlingxx merged commit 3944c6a into NVIDIA:release/25.12 Dec 1, 2025
62 checks passed
@sperlingxx sperlingxx deleted the quick_fix_unstable_case_2512 branch December 1, 2025 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working test Only impacts tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants